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Preface 



We are very pleased to present the proceedings of the First Workshop on Bioin- 
formatics (WABI 2001), which took place in Aarhus on August 28-31, 2001, 
under the auspices of the European Association for Theoretical Computer Sci- 
ence (EATCS) and the Danish Center for Basic Research in Computer Science 
(BRICS). 

The Workshop on Algorithms in Bioinformatics covers research on all aspects 
of algorithmic work in bioinformatics. The emphasis is on discrete algorithms 
that address important problems in molecular biology. These are founded on 
sound models, are computationally efficient, and have been implemented and 
tested in simulations and on real datasets. The goal is to present recent research 
results, including significant work-in-progress, and to identify and explore direc- 
tions of future research. Specific topics of interest include, but are not limited 
to: 

— Exact and approximate algorithms for genomics, sequence analysis, gene and 
signal recognition, alignment, molecular evolution, structure determination 
or prediction, gene expression and gene networks, proteomics, functional 
genomics, and drug design. 

— Methods, software and dataset repositories for development and testing of 
such algorithms and their underlying models. 

— High-performance approaches to computationally hard problems in bioinfor- 
matics, particularly optimization problems. 

A major goal of the workshop is to bring together researchers spanning the 
range from abstract algorithm design to biological dataset analysis, to encourage 
dialogue between application specialists and algorithm designers, mediated by 
algorithm engineers and high-performance computing specialists. We believe that 
such a dialogue is necessary for the progress of computational biology, inasmuch 
as application specialists cannot analyze their datasets without fast and robust 
algorithms and, conversely, algorithm designers cannot produce useful algorithms 
without being aware of the problems faced by biologists. Part of this mix was 
achieved automatically this year by colocating into a single large conference, 
ALCO 2001, three workshops: WABI 2001, the 5th Workshop on Algorithm 
Engineering (WAE 2001 ), and the 9th European Symposium on Algorithms (ESA 
2001 ), and sharing keynote addresses among the three workshops. ESA attracts 
algorithm designers, mostly with a theoretical leaning, while WAE is explicitly 
targeted at algorithm engineers and algorithm experimentalists. 

These proceedings reflect such a mix. We received over 50 submissions in 
response to our call and were able to accept 23 of them, ranging from mathe- 
matical tools through to experimental studies of approximation algorithms and 
reports on significant computational analyses. Numerous biological problems are 
dealt with, including genetic mapping, sequence alignment and sequence analy- 
sis, phylogeny, comparative genomics, and protein structure. 




VI 
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We were also fortunate to attract Dr. Gene Myers, Vice-President for Infor- 
matics Research at Celera Genomics, and Prof. Jotun Hein, Aarhus University, 
to address the joint workshops, joining five other distinguished speakers (Profs. 
Herbert Edelsbrunner and Lars Arge from Duke University, Prof. Susanne Al- 
bers from Dortmund University, Prof. Uri Zwick from Tel Aviv University, and 
Dr. Andrei Broder from Alta Vista). The quality of the submissions and the 
interest expressed in the workshop is promising - plans for next year’s workshop 
are under way. 

We would like to thank all the authors for submitting their work to the 
workshop and all the presenters and attendees for their participation. We were 
particularly fortunate in enlisting the help of a very distinguished panel of re- 
searchers for our program committee, which undoubtedly accounts for the large 
number of submissions and the high quality of the presentations. Our heartfelt 
thanks go to all: 

Graig Benham (Mt Sinai School of Medicine, New York, USA) 

Mikhail Gelfand (Integrated Genomics, Moscow, Russia) 

Raffaele Giancarlo (U. di Palermo, Italy) 

Michael Hallett (McGill U., Ganada) 

Jotun Hein (Aarhus U., Denmark) 

Michael Hendy (Massey U., New Zealand) 

Inge Jonassen (Bergen U., Norway) 

Junhyong Kim (Yale U., New Haven, USA) 

Jens Lagergren (KTH Stockholm, Sweden) 

Edward Marcotte (U. Texas Austin, USA) 

Satoru Miyano (Tokyo U., Japan) 

Gene Myers (Gelera Genomics, USA) 

Marie-France Sagot (Institut Pasteur, France) 

David Sankoff (U. Montreal, Ganada) 

Thomas Schiex (INRA Toulouse, France) 

Joao Setubal (U. Gampinas, Sao Paolo, Brazil) 

Ron Shamir (Tel Aviv U., Israel) 

Lisa Vawter (GlaxoSmithKline, USA) 

Martin Vingron (Max Planck Inst. Berlin, Germany) 

Tandy Warnow (U. Texas Austin, USA) 

In addition, the opinion of several other researchers was solicited. These subref- 
erees include Tim Beissbarth, Vincent Berry, Benny Ghor, Eivind Goward, Ing- 
var Eidhammer, Thomas Faraut, Nicolas Galtier, Michel Goulard, Jacques van 
Helden, Anja von Heydebreck, Ina Koch, Ghaim Linhart, Hannes Luz, Vsevolod 
Yu, Michal Ozery, Itsik Pe’er, Sven Rahmann, Katja Rateitschak, Eric Rivals, 
Mikhail A. Roytberg, Roded Sharan, Jens Stoye, Dekel Tsur, and Jian Zhang. 
We thank them all. 

Lastly, we thank Prof. Erik Meineche-Schmidt, BRIGS codirector, who 
started the entire enterprise by calling on one of us (Bernard Moret) to set up the 
workshop and who led the team of committee chairs and organizers through the 
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setup, development, and actual events of the three combined workshops, with 
the assistance of Prof. Gerth Brpdal. 

We hope that you will consider contributing to WABI 2002, through a sub- 
mission or by participating in the workshop. 

June 2001 Olivier Gascuel and Bernard M.E. Moret 
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Abstract. The statistical approach to molecular sequence evolution involves the 
stochastic modeling of the substitution, insertion and deletion processes. Substi- 
tution has been modeled in a reliable way for more than three decades by using 
finite Markov-processes. Insertion and deletion, however, seem to be more dif- 
ficult to model, and the recent approaches cannot acceptably deal with multiple 
insertions and deletions. A new method based on a generating function approach 
is introduced to describe the multiple insertion process. The presented algorithm 
computes the approximate joint probability of two sequences in 0{f) running 
time where I is the geometric mean of the sequence lengths. 



1 Introduction 

The traditional sequence analysis [1] needs proper evolutionary parameters. These pa- 
rameters depend on the actual divergence time, which is usually unknown as well. An- 
other major problem is that the evolutionary parameters cannot be estimated from a 
single alignment. Incorrectly determined parameters might cause unrecognizable bias 
in the sequence alignment. 

One way to break this vicious circle is the maximum likelihood parameter estima- 
tion. In the pioneering work of Bishop and Thompson [2], an approximate likelihood 
calculation was introduced. Several years later, Thorne, Kishlno, and Felsenstein wrote 
a landmark paper [3], in which they presented an improved maximum likelihood algo- 
rithm, which estimates the evolutionary distance between two sequences involving all 
possible alignments in the likelihood calculation. Their 1991 model (frequently referred 
to as the TKF91 model) considers only single insertions and deletions, but this con- 
sideration is rather unrealistic [4,5]. Later it was further improved by allowing longer 
insertions and deletions [4] in the model, which is usually coined as the TKF92 model. 
However, this model assumes that sequences contain unbreakable fragments, and only 
whole fragments are inserted and deleted. As it was shown [4], the fragment model has 
a flaw: considering unbreakable fragments, there is no possible explanation for overlap- 
ping deletions with a scenario of just two events. This problem is solvable by assuming 
that the ancestral sequence was fragmented independently on both branches immedi- 
ately after the split, and sequences evolved since then according to the fragment model 
[6]. However, this assumption does not solve the problem completely: fragments do not 
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have biological realism. The lack of the biological realism is revealed when we want 
to generalize this split model for multiple sequence comparison. For example, consider 
that we have proteins from humans, gorillas and chimps. When we want to analyze 
the three sequences simultaneously, two pairs of fragmentation are needed: one pair 
at the gorilla-(human and chimp) split and one at the human-chimp split. When only 
sequences from gorillas and humans are compared, the fragmentation at the human- 
chimp split is omitted. Thus, the description of the evolution of two sequences depends 
on the number of the introduced splits, and there is no sensible interpretation to this 
dependence. 



1.1 The Thorne-Kishino-Felsenstein Model 

Since our model is related to the TKF91 model we describe it briefly. Most of the 
definitions and notations are introduced in here. 

The TKF model is the fusion of two independent time-continuous Markov pro- 
cesses, the substitution and the insertion-deletion process. 



The Suhstitution Process: Each character can be substituted independently for an- 
other character dictated by one of the well-known substitution processes [7], [8]. The 
substitution process is described by a system of linear differential equations 

where Q is the rate matrix. Since Q contains too many parameters, it is usually sepa- 
rated into two components, Qqs, where Qo is kept constant and is estimated with a less 
rigorous method than maximum likelihood [4]. The solution of m is 

x(f) = e‘^°®*x(0) (2) 



The Insertion-Deletion Process: The insertion-deletion process is traditionally de- 
scribed not in terms of amino acids or nucleotides but in terms of imaginary links. A 
mortal link is associated to the right of each character, and additionally, there is an im- 
mortal link at the left end of the sequence. Each link can give birth to a mortal link with 
birth rate A. The newborn link always appears at the right side of its parent. Accompa- 
nying the birth of a mortal link, is the birth of a character drawn from the equilibrium 
distribution. Only mortal links can die out with death rate /i, taking their character to 
the left with them. Assuming independence between links, it is sufficient to describe 
the fate of single mortal link and the immortal one. According to the possible histo- 
ries of links (Eigure|3, three types of functions are considered. Letp[,^^(f) denote the 
probability that after time t, a mortal link has survived, and has exactly k descendants 
including itself. Let p^. (t) denote the probability that after time t, a mortal link died, 
but it left exactly k descendants. Let pk{t) denote the probability that after time t, the 
immortal link has exactly k descendants, including itself. 
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Fig. 1. The Possible Fates of Links. The second column shows the fate of the immortal 
link (o). After a time period t it has k descendants including itself. The third column 
describes the fate of a survived mortal link (*). It has k descendants including itself 
after time t. The fourth column depicts the fate of a mortal link that died, but left k 
descendants after time t. 



Calculating the Joint Probability of Two Sequences: The joint probability of two 
sequences A and B is calculated as the equilibrium probability of sequence A times the 
probability that sequence B evolved from A under time 2t, where t is the divergence 
time. 

P{A,B) = P^{A)P2t{B\A) (3) 

A possible transition is described as an alignment. The upper sequence is the ancestor; 
the lower sequence is the descendant. For example the following alignment describes 
that the immortal link o has one descendant, the hrst mortal link * died out, and the 
second mortal link has two descendants including itself. 

O - A* U* - 

O G* - C* A* 



The probability of an alignment is the probability of the ancestor, times the probability 
of the transition. For example, the probability of the above alignment is 

727r(A)7r(C/)p2(t)7r(G)p[,^^(f)p^^\f)/(7c(2t)7r(A) (4) 

where 7 „ is the probability that a sequence contains n mortal links, 7 t(A) is the fre- 
quency of the character X, and is the probability that a character i is of j at 

time 2t. The joint probability of two sequences is the summation of the alignment prob- 
abilities. 

2 The Model 

Our model differs from the TKF models in the insertion-deletion process. The TKF91 
model assumes only single insertions and deletions, as illustrated in Figure Long 
insertions and deletions are allowed in the TKF92 model, as illustrated in Figure 0 
However, these long indels are considered as unbreakable fragments as they have only 
one common mortal link. The death of the mortal link causes the deletion of every char- 
acter in the long insertion. The distinction from the previous model is that in our model 
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X 




A* A*C* A*C*G* 




H |i 



Fig. 2. The Flow-chart of the TKF91 Model. Each link can give birth to a mortal link 
with birth rate A > 0. Mortal links die with death rate ^ > 0. 



Ar(l-r) 





Fig. 3. The Flowchart of the Thorne-Kishino-Felsenstein Fragment Model. A link can 
give birth to a fragment of length k with birth rate Ar(l — r) with A > 0 and 
0 < r < 1. Fragments are unbreakable so that only whole fragments can die with death 
rate ^ > 0. 



Ar(l-r) 




A* A*C* A*C*G* 




F F 



Fig. 4. The Flowchart of Our Model. Each link can give birth to k mortal links with birth 
rate Ar(l — with A > OandO < r < 1 . Each newborn link can die independently 
with death rate /r > 0. 



every character has its own mortal link in the long insertions, as illustrated in Figure 0 
Thus, this model allows long insertions without considering unbreakable fragments. It 
is possible that a long fragment is inserted into the sequence first and some of the in- 
serted links die and some of them survive after then. A link gives birth to a block of k 
mortal links with rate Afc, where 

Afe = Ar(l - fc = 1, 2, . . . , A > 0, 0 < r < 1 (5) 

Only mortal links can die with rate ^ > 0. 
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2.1 Calculating the Generating Functions 

The Master Equation: First, the probabilities of the possible fates of the immortal 
link is computed. Collecting the gain and loss terms for this birth-death process, the 
following Master equation is obtained: 



dpn 

dt 



n—1 

- j)XjPn-j + nppn+i 
i=i 



/ oo \ 

Aj + (n - l)/r 



Using 1 Xj = X and J2j=i('^ ~ = YJk=i kXn-kPk, we have: 



dpn 

dt 



n—1 

= Ar ^ k{l - r)‘^~^~^Pk + nppn+i 

k=l 



{nX+ {n- l)p)pn 



( 6 ) 



(7) 



Due to the immortal link, we have Vf, po{t) = 0. For n = 1, the sum in 0 is void. The 
initial conditions are given by: 

P„(0) = Sn.i (8) 

Next, we introduce the generating function [9]: 



p{^-,t) = Y.cPuit) 

n—O 



(9) 



Multiplying 0 by then summing over n, we obtain a linear PDF for the generating 
function: 



-(1-0 U- 






with initial condition P(^; 0) = 



AC 



l-f(l-r) 






(10) 



Solution to the PDE for the Generating Function: We use the method of Lagrange: 



dt d^ 



dP 

-(1-Op 



(11) 



The two equalities define two, one-parameter families of surfaces, namely v{t; P) 
and w{t; P). After integrating the first and the second equalities in (II 111 the following 
families of surfaces are obtained: 






( 1 - 0 - 
{pL — a0^/“ 



Cl 



(12) 



zu(C;t;P) = P^^^-^ = C2 (13) 

with a = X + fi{l — r) >0. The general form of the solution is an arbitrary function of 
w = g{v). This means: 



( 1 - 0 ^ 









( 14 ) 
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The function g is fixed from the initial condition P{^, 0) = 

g(z) = (/r-a/~^(z))^^“ 



where 



fix) 



(1 - x)^ 



Thus the exact form for the generating function becomes: 



(15) 

(16) 






g-ai 



A/a 



(17) 



The Probabilities for the Fate of the Mortal Links: The Master Equations for the 
probabilities pn'^ (t) and p^^ (t) are given by 



j (1) "-1 

= Vtr,. - .•U.-r.W 

dt 



= - j)XjP^^P + Aj + n/i p: 



n(l) 



(18) 



i=i 



i=i 



dp' 



(2) n-1 



dt 



= ^ (n - j)XjPn-j + {n + l)Mpi+i + PPnh - ^ 



i=i 



i=i 



We have the following conditions to be fulfilled: 

Vt > 0, p^o\t) = 0 

and the initial conditions: 

Vn>0, 



(20) 



( 21 ) 



The corresponding partial differential equations for the generating functions, 
(C) t) = (^)’ for * = 1, 2, are given by 



dpW 

dt 

dp{2) 

dt 



(1 - 0 - 
- (1 - 0 



^ 

1 -^( 1 - 

^ 

i-e(i 




dpd) 

dp(^) 

~w~ 



^p(l) 



(22) 

(23) 



Solution to the PDFs for the Generating Functions of the Mortal Links: First, we 
solve (Ea) using the method of Lagrange 



dt d^ 

^ -(1-C) (m- 1-/X)) 



dP^P 



(24) 
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The two, one-parameter families of surfaces are u(f; and w{t', P^^'^). Since v 

comes from the integration of the first equality in (E3> it is the same as O. Integrating 
the second equality yields: 









C2 



(25) 



Proceeding as in the previous section, we have: 






(26) 



with / given by (fTO . To calculate P‘'^\^;t), we first define Q{^;t) = P^^\^;t) + 
t). Summing (|22|) and (O the following equation is obtained for Q: 



dt 



(1-aU 



^ 

1-^(1 





(27) 



This is again easily solved with the method of characteristics. First, we integrate the 
characteristic equation, which is the first equation in (^, to obtain the family of char- 
acteristic curves, given by u(^; t) = ci as in (Tf^ . Thus, t) = g{v) is the general 
solution, where g{x) is an arbitrary, differentiable function, to be set by the initial con- 
ditions. Using (EO) and dTIT) . we have Q(^; 0) = This leads to: 

= (28) 

with / given by (E), and therefore: 

- P‘'^\i;t) (29) 

withP(i)(^;f) given by (E3). 



2.2 The Equilibrium Length Distribution 

The generating function of the equilibrium length distribution can be obtained from (El) 
by considering the limit t — > oo. Since /“^(O) = 1 and due to the immortal link, the 
generating function becomes 



m = 



— a 



(30) 



Calculating the Taylor-series of P(^) around 0, we get for the equilibrium probabilities: 



7„ = (^ - a) - 



v 77lLV(A + za) 

nl^n-l + X/a 



(31) 



From 

as: 



dr{q 

d5 



in the limit of ^ ^ 1 , the expected value of the sequence length is obtained 



E{j) = 



A 

fir — X 



(32) 
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3 The Algorithm 



3.1 Calculating the Transition Probabilities 

Unfortunately, the inverse of / given by djl does not have a closed form. Thus a 
numerical approach is needed for calculating the transition probability functions Pn{t), 
Pn\t), and pn\t). We calculate the generating functions and 

(^; f) in + 1 points around ^ = 0, where 1 1 is the length of the shorter sequence. 
For doing this, the following equation must be solved for x numerically where p, A, 
r,t, and a are given. 

^ (33) 

(fJ, — aX) a 

Given / i = 1 points, the functions are partially derived li times. After this 



Pu{t) = 



d^P{^,t) 1 
9^" n! 



(34) 



and similarly for pn \t) and for: p^\t). Thus, the transition probability functions can 
be calculated in 0{P) time. 



3.2 Dynamic Programming for the Joint Probability 

Without loss of generality we can suppose that the shorter sequence is sequence B. The 
equilibrium probability of sequence A is 



-Poo (A) = (^ - a)“ 






ia) 



TT{ai) 



(35) 



where is the ith character in A and 1{A) is the length of the sequence. 

Let Ai denote the i-long prefix of A and let Bj denote the j-long prefix of B. 
There is a dynamic programming algorithm for calculating the transition probabilities 
Pt{Ai I Bj). The initial conditions are given by: 



Pt{Ao I Bj) = pn+i{t)nl^^TT{bk) (36) 

To save computation time, we calculate n^^^Tr{hk) for every I < j before the recursion. 
Then the recursion follows 

Pt{A, I B,) = I Bi)pf},{t)nU^,n{bk) 

1=0 

+ Y,P,{A-l\Bdp^,if)fa ihi + iP^fe=j+2 7T(5fc) (37) 

1=0 

The dynamic programming is the most time-consuming part of the algorithm, it takes 
0{P) running time. 
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3.3 Finding the Maximum Likelihood Parameters 

As mentioned earlier, the substitution process is described with only one parameter, 
st. (A general phenomenon is that the time and rate parameters can not be estimated 
individually, only their product.) The insertion-deletion model is described with three 
parameters, At, /it, and r, which however, can be reduced to two, if the following equa- 
tion is taken under consideration 



A l{A) + l{B) 

^ = 2 

namely, the mean of the sequence lengths is the maximum likelihood estimator for the 
expected value of the length distribution. 

The maximum likelihood values of the three remaining parameters can be obtained 
using one of the well-known numerical methods (gradient method, etc.). 



4 Discussion and Conclusions 

There is an increasing desire for statistical methods of sequence analysis in the bioinfor- 
matics community. The statistical alignment provides a sensitive homology testing [5], 
which is better than the traditional, similarity-based methods [10]. The summation over 
the possible alignments leads to a good evolutionary parameter estimation [3], while 
the parameter estimation from a single alignment is doubtful [3,1 1]. 

Methods based on evolutionary models integrate the multiple alignment and the 
evolutionary tree reconstruction. The generalization of the Thorne-Kishino-Felsenstein 
model to arbitrary number of sequences is straightforward [12,13]. A novel approach is 
to treat the evolutionary models as HMM. The TKF model hts into the concept of pair- 
HMM [14]. Similarly, the generalization to n sequences can be handled as multiple- 
HMM. Following this approach, one can sample alignments related to a tree providing 
an objective approximation to the multiple alignment problem [15]. Sampling pairwise 
alignments and evolutionary parameters allows further investigations of the evolution- 
ary process [16]. 

The weak point of the statistical approach is the lack of an appropriate evolutionary 
model. A new model and an associated algorithm for computing the joint probability 
were introduced. This new model is superior to the Thorne-Kishino-Felsenstein model: 
it allows long insertions without considering unbreakable fragments. However, it is only 
a small inch to the reality, as it contains at least two unrealistic properties. It cannot 
deal with long deletions, and the rates for the long insertions form a geometric series. 
The elimination of both these problems seems to he rather difficult but not impossible. 
Other rate functions for long insertions lead to more difficult PDE-s whose characteris- 
tic equations may not he integrated without a rather involved computational overhead. 
The same situation appears when long deletions are allowed. Moreover, in this case 
calculating only the fates of the individual links is not sufficient. Thus, for achieving 
more appropriate models, numerical calculations are needed in an earlier state of the 
procedure. Nevertheless, we hope that the generating function approach will open some 
novel avenues for further research. 
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Abstract. Alignments of frequency profiles against frequency profiles have a 
wide scope of applications in currently used bioinformatic analysis tools ranging 
from multiple alignment methods based on the progressive alignment approach 
to detecting of structural similarities based on remote sequence homology. We 
present the new log average scoring approach to calculating the score to be used 
with alignment algorithms like dynamic programming and show that it signifi- 
cantly outperforms the commonly used average scoring and dot product approach 
on a fold recognition benchmark. The score is also applicable to the problem of 
aligning two multiple alignments since every multiple alignment induces a fre- 
quency profile. 



1 Introduction 

The use of alignment algorithms for the establishing of protein homology relationships 
has a long tradition in the field of bioinformatics. When first developed, these algorithms 
aimed at assessing the homology of two protein sequences and at constructing their best 
mapping onto each other in terms of homology. By extending these algorithms to align 
sequences of amino acids not only to their counterparts but to frequency profiles, which 
was first proposed by Gribskov ca, it became feasible to analyse the relationship of a 
single protein with a whole family of proteins described by the frequency profile. Based 
on this idea the PSTBlast program |Q was developed which belongs to the most well 
known and heavily used tools in computational biology. Recently, a further abstraction 
has proven to be of considerable use in protein structure prediction. In the CAFASP2 
contest of fully automated protein structure prediction the group of Rychlewski et al. 
reached the second rank using a profile-profile alignment method called FFAS [ESI. 
The notion of alignment is thus extended to provide a mapping between two protein 
families represented by their frequency profiles. Rychlewski et al. used the dot product 
to calculate the alignment score for a pair of profile vectors. In this paper we present a 
new approach which allows to choose an amino acid substitution model like the BLO- 
SUM model fl2] and leads to a score that not only increases the ability to judge the 
relatedness of two proteins by the alignment score but also has a meaning in terms of 
the underlying substitution model. 

We start by introducing the definition of profiles and subsequently discuss the three 
candidate methods for scoring profile vectors against each other. In the second part 
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the fold recognition experiments we performed are described and discussed. In the ap- 
pendix further technical information on the benchmarks can be found. 

2 Theory 

The use of prohles is to represent a set of related proteins by a statistical model that 
does not increase in size even when the set of proteins gets large. This is done by mak- 
ing a multiple alignment of all the sequences in the set and then counting the relative 
frequencies of occurrence of each amino acid in each position of the multiple align- 
ment. Usually it is assumed that the underlying set of proteins is not known completely, 
but that we have a small subset of representatives, for instance from a homology search 
over a database. Extensive work has been done on the issue of estimating the “real” fre- 
quencies of the full set from the sample retrieved by the homology search. Any of these 
methods like pseudo counts (QOI , Dirichlet mixture models Ei , minimal-risk estimation 
m , sequence weighting methods may be used to preprocess the sample 

to get the best estimation of the frequencies before one of the following scoring meth- 
ods is applied. In any case will the construction yield a vector of probability vectors 
which are in our setting of dimension 20 (one for each amino acid). These probabilities 
are positive real numbers that sum up to one and stand for the probability of seeing a 
certain amino acid in this position of a multiple alignment of all family members. This 
sequence of vectors will be called frequency prohle or prohle throughout the paper. The 
gaps occurring in the multiple alignments are not accounted for in our models, there- 
fore the frequency vectors must eventually be scaled up to reach a total probability sum 
of one. All of the profile-to-prohle scoring methods introduced will be dehned by a 
formula which gives the corresponding score depending on two probability vectors (or 
profile positions) named a and j3. 

2.1 Dot Product Scoring 

The simplest and fastest method is the dot product method as used by Rychlewski et 
al. HE)- This is a rather heuristic approach since a possible interpretation of the sum of 
these scoring terms over all aligned positions remains unclear. The score is calculated 
as 

20 

SCOre^ot product(o^, /?) — ^ ^ O^iPi 

i=l 

which is in fact the probability that identical amino acids are produced by drawing from 
the distribution a and (3 independently. The log of this score might therefore serve as 
a meaningful measure of profile similarity but this is not discussed here. As can be 
seen this scoring approach does not incorporate any knowledge about the similarities 
between the amino acids and is therefore independent of any substitution matrix. 

2.2 Sequence-Sequence Alignment 

When aligning two amino acid sequences, the score is calculated as a likelihood ratio 
between the likelihood that the alignment occurs between “related” sequences and the 
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likelihood that the alignment occurs between “unrelated” sequences. The notion of re- 
latedness is defined here by the employed substitution model, which incorporates two 
probability distributions describing each case. The first distribution, called null model, 
describes the average case in which the two positions are each distributed like the amino 
acid background and are unrelated, yielding P{X = i,Y = j) = PiPj- Here pk stands 
for the probability of seeing an amino acid k when randomly picking one amino acid 
from an amino acid sequence database. The probability of seeing a pair of amino acids 
in a “related” pair of sequences in corresponding positions has been estimated by some 
authors using different methods. M. Dayhoff derived the distribution from observations 
of single point mutations resulting in the series of PAM Matrices M- In the case of the 
BLOSUM matrix series [131 the distribution is derived from blocks of multiply aligned 
sequences, which are clustered up to a certain amount of sequence identity. We intro- 
duce an event called “related” for the case that the values of X and Y are related amino 
acids and call the probability distribution P{X = i,Y = j|related) = prei(*, j)- Using 
this, we receive the formula for the log odds score (“log” always standing for the natural 
logarithm) 

M{z,j) = log(P^^) ( 1 ) 

V PiPj J 

which are the values stored in the substitution matrices except for a constant factor 
which is Q in the Dayhoff models and in the BLOSUM matrices. 

Using Bayes’ formula we get an interpretation of the likelihood ratio term defining 
the log odds alignment score: 

P(related|X = i,Y = j) = P(related) y (2) 

= P(related)^^^^i^^ (3) 

PiPj 

This means that except for the prior probability P (related), which is a constant, the 
usual sequence-sequence alignment score is the log of the probability that the two amino 
acids come from related positions, given the data. 

If different positions are assumed to be independent of each other, the log odds 
score summed up over all aligned amino acid pairs is the log of the probability that 
the alignment occurs in case the sequences are related divided by the probability that 
the alignment occurs between unrelated sequences. It is therefore in a certain statis- 
tical sense the best means to decide whether the alignment maps related or unrelated 
sequences onto each other (Neyman-Pearson lemma, e.g. [El)- This quantity will be 
maximised by the dynamic programming approach yielding the alignment that max- 
imises the likelihood ratio in favour of the “related” hypothesis. The gap parameters 
add penalties to this log-likelihood-ratio score, which indicate that the more gaps an 
alignment has, the more likely it is to occur between unrelated sequences rather than 
between related sequences. 

2.3 Average Scoring 

The average scoring method has been the very first approach to scoring frequency pro- 
files against amino acid sequences na. The basic idea is that the score for a distribu- 
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lion of amino acids is calculated by taking the expected value of the sequence-sequence 
score under the profile distribution while keeping the amino acid from the sequence 
fixed. This can be extended to to profile-profile alignments in a straightforward fashion 
and has been used in ClustalW C21- There, two multiple alignments are aligned using 
as score an average over all pairwise scores between residues, which is equivalent to 
the average scoring approach as used here. The formula which we obtain this way is the 
following: 



It can easily be shown that this score has an interpretation: Let TV be a large integer 
and lets take a sample of size N from the two profile positions (each sample being a 
pair of amino acids, the distribution being Then this score divides 

the likelihood that the related distribution produced the sample by the likelihood that 
the unrelated distribution produced the sample, takes the log and divides this by N. 
The average score summed up over all aligned profile positions fhus has fhe following 
meaning: If we draw for each aligned pair of profile positions a sample of size N which 
happens fo show Naifij limes the amino acid pair (i,j), then the summed up average 
score is the best means to decide whether this happens rather under the “related” or the 
“unrelated” model. 

The problem with the approach is, that this is not the question we are asking. The 
two distributions (“related” and “unrelated”) that are suggested as only options are both 
known to be inappropriate since their marginal distributions (the distributions that are 
obtained by fixing the first letter and allowing the second to take any value and vice 
versa) are the background distribution of amino acids by the definition of the substitu- 
tion model. The appropriate setting for this model describes a situation in which each 
profile posilion would in fact be occupied by a completely random amino acid (de- 
termined by the background probability distribution) meaning that, if we drew more 
and more amino acids from a position, then the observed distribution would have to 
converge to the background amino acid distribution. This is not compatible with the 
meaning usually associated with a profile vector which is thought of being itself the 
limiting distribution to which the distribution of such a sample should converge. 

Another drawback to this method is the fact that the special case of this formula, 
when one of the profiles degenerates to a single sequence (at each position a probability 
distribution which has probability one for a single amino acid), has not the expected 
behaviour of a good scoring system. This will be shown in the following section, where 
we will extend the commonly used sequence-sequence score in a first step to the profile- 
sequence selling such that a strict statistical interpretation of the score is at hand and 
then further to the profile-profile selling which will be evaluated further on. 

2.4 Profile-Sequence Scoring 

The sequence-sequence scoring dDl can be extended to the profile-sequence case in a 
slraightforward manner. It has been noted several times (e.g. O) that for the case that 
the target distribution of amino acids in a profile position a is known, the score given 




20 20 



(4) 



Improving Profile-Profile Alignments via Log Average Scoring 



15 



by 

GL ' 

score(a,j) = log— (5) 

Pj 

yields an optimal test statistics by which to decide whether the amino acid j is a sam- 
ple from the distribution a or rather from the background distribution p. These values 
summed up over all aligned positions therefore give a direct measure of how likely it is 
that the amino acid sequence is a sample from the profile rather than being random. If 
for a protein family only the corresponding profile is known, calculating this score is an 
optimal way to decide whether an amino acid sequence is from this family or not. This 
is a rather limited question to ask if we want to explore distant relationships. Therefore, 
in our setting it is of interest whether the sequence is somehow evolutionary related to 
the family characterised by the profile or not. 



Evolutionary Profile-Sequence Scoring. One method for evaluating this in the profile- 
sequence case is the evolutionary profile method irrm which only makes use of the 
evolutionary model underlying the amino acid similarity matrices. The values P{i 
j) _ can, due to the construction of the similarity matrix, be interpreted as 

the transition probabilities for a probabilistic transition (mutation) of the amino acid i 
to j. From this point of view the value from (0 can be written as M{i,j) = 

log which can be read as the likelihood ratio of amino acid j having occurred by 

transition from amino acid i against j occurring just by chance. This can be extended 
to the profile-sequence case where i is replaced with the profile vector a and letting 
the same probabilistic transition take place on a random amino acid with distribution 
a instead of on the fixed amino acid i. The resulting probability of j occurring by 
transition from an amino acid distributed like a is given by 



20 

aiP{i 

i=l 



20 



j) = 



Frel(i, j) 



i=l 



Pi 



( 6 ) 



which leads to the score 



Score{a, j) 



log 



Y.i=i j) 

Pj 



20 

log a, 
2=1 



Prel(t, j) 

PiPj 



(7) 



This score summed up over all aligned positions in an alignment of a profile against 
a sequence is therefore an optimal means by which to decide whether the sequence is 
more likely the result of sampling from the profile which has undergone the probabilis- 
tic evolutionary transition or whether the sequence occurs just by chance (optimality in 
a statistical sense). 

It is apparent that the formula 0is not a special case of the earlier introduced average 
scoring i). This is a drawback for the average scoring approach since it fails to yield an 
intuitively correct result in a simple example: If the profile position is distributed like 
the amino acid background distribution, i. e. a; = Pi for all i, we would expect that we 
have no information available on which to decide whether an amino acid j is related 
with the profile position or not. Thus it is a desirable property of a scoring system that 
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any amino acid j should yield zero when scored against the background distribution. 
This is the case for the evolutionary profile-sequence score but is not the case for the 
average score where we receive (with p being the background distribution and e j being 
the j-th unit vector) 

score(a,j) = scoreaverage(p, ej) = ^ piM{i,j) 

i=l....,20 

which is never positive due to Jensen’s inequality (see e.g. [Q) and will always be neg- 
ative for the amino acid background distribution commonly observed. Thus the average 
score would propose that we have evidence against the hypothesis that the profile po- 
sition and the amino acid are related, which seems questionable. This is the motivation 
to look for a generalisation of the evolutionary sequence-profile scoring scheme to the 
profile-profile case. The results are explained in the following section which introduces 
the new scoring function proposed in this paper. 



2.5 Log Average Scoring 

Let again (X, Y) be a pair of random variables with values in {1, ... , 20} which rep- 
resent positions in profiles for which the question whether they are related is to be 
answered. Since the goal here is to score profile positions against profile positions we 
have to incorporate into our model the fact that the special X and Y we are observing 
have the amino acid distribution (ai)i=i^,,,. 2 o and (/3j)j=i,,,,,20j respectively. This is 
done by introducing an event E which has the following property: 



P^X = i,Y = j\E) = a,!3j (8) 

This leads to the equations 
20 20 

P(related|T;) P{X = i,Y = j, related|T;) (9) 

i=i 
20 20 

= = i,Y = j\E)P{rs\ated\X = i,Y = j,E) (10) 

i=i 

Since a substitution model that directly addresses the case E with its special distribu- 
tions of a and (3 is not available for the calculation of the last factor, we use the standard 
model (see equation (0) as an approximation instead and exploit the knowledge on the 
amino acid distributions (see O) at the current profile positions for the first factor: 



20 20 

«EE P{X = i,Y = j\E)P{rs\ated\X 

i=i j=i 



20 20 

= P(related) ^ ^ ai(3j 

1=1 j=i 



Prel(^, j) 

Pipj 



i,Y = j) 



(11) 



( 12 ) 
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If the prior probability is set to 1 and the log is taken like in the usual sequence- 
sequence score we receive the following formula for the log average score 



It is interesting to note that the only difference between this formula and the average 
score is the exchanged order of the log and the sums. As can be seen this formula is 
an extension of the evolutionary profile score for the profile-sequence case with the 
advantages discussed above. If these scoring terms are summed up over all aligned 
positions in a profile-profile alignment the resulting alignment score is thus the log of 
the probability that the profiles are related under the substitution model given the data 
they provide (except for the prior). 

3 Evaluation 

In order to evaluate whether the different scores are a good measure of the relatedness 
of two profiles, we performed fold recognition and related pair recognition benchmarks. 
Additionally, we investigated how a confidence measure for the protein fold prediction 
depending on the underlying scoring system performed on the benchmark set of pro- 
teins. 

3.1 Data Set 

The experiments were carried out using a protein sequence set which consists of 151 1 
chains from a subset of the PDB with a maximum of 40% pairwise sequence identity 
(see f5l). The composition of the test set in terms of relationships on different SCOP 
levels is shown in figure [0 Throughout the experiments the SCOP version 1.50 is used 



Note that there are 34 proteins in the set which are the only representatives of their 
SCOP fold in the test set. They were deliberately left in the test set even though it is not 
possible to recognise their correct fold class because this way the results resemble the 
numbers in the application case of a query with unknown structure. 

For all sequences a structure of the same SCOP class can be found in the benchmark 
set, there are 34 chains in the set without a corresponding fold representative (i.e. single 
members of their fold class in the benchmark), SCOP superfamily and SCOP family 
representatives can be found for 1360 and 1113 sequences of the test benchmark set, 
respectively. 

Only chains contributing to a single domain according to the SCOP database were 
used in order to allow for a one-to-one mapping of the chains to their SCOP classifi- 
cation. For each chain a frequency profile representing a set of possibly homologous 
sequences was constructed based on PSTBlast searches on a non redundant sequence 
database following a procedure described in the appendix. 
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Composition of the test set 




SCOP class SCOP fold SCOP superfamily SCOP family 
No. of proteins for which ttie testset contains a member of ttie same ... 



Composition of the test set (1511 single domain chains from PDB40) 




SCOP superfamily 



SCOP fold 

SCOP class (no fold recognition) 



Closest relative in the test set belongs to the same ... 



Fig. 1. Composition of the Test Set. Left; Number of proteins for which the test set con- 
tains a member of the indicated SCOP level. Right: Number of proteins whose closest 
relative (in terms of SCOP level) in the test set belongs to the indicated SCOP level. 
This is a partition of the test set in terms of fold recognition difficulty; ranging from 
SCOP family being the easiest to SCOP class being impossible. 



3.2 Implementation Details 

For each examined scoring approach we then used a JAVA implementation of the Gotoh 
global alignment algorithm @ to align a query profile against each of the remaining 
1510 profiles in the test set. For a query sequence of length 150 about 6 alignments per 
second can be computed on a AOOMHz Ultra Sparc 10 workstation. 

It should be noted that for the case of fold recognition where one profile is subse- 
quently being aligned against a whole database of profiles a significant speedup can be 
achieved by preprocessing the query profile a and calculating 



a := 




Oii 

PiPj 



j=l,...,2Q 



thus reducing the score calculation 



20 

SCOreiogaverage(o;5 /^) — ^ ^ ^jPj 

i=i 

to one scalar product and one logarithm. This can be done in a similar manner with the 
average scoring approach where the complexity reduces to only the scalar product. The 
running time of the algorithm could be reduced by a factor of more than 6 using this 
technique. 

3.3 Alignment Parameters 

The appropriate gap penalties were determined separately for every scoring method 
using a machine learning approach (see appendix, [E3) and are shown in table [H 
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Table 1. Gap Penalties Used for the Experiments. 



scoring 


gap open 


gap extension 


dot product 


3.12 


0.68 


average 


5.60 


1.22 


log average 


10.35 


0.16 



Throughout the experiments shown here we used the BLOSUM 62 substitution model 
KI21. The average scoring alignments were calculated using the values from the BLO- 
SUM 62 scoring matrix and, thus, contain the above mentioned scaling factor of / = 
To keep the results comparable we also applied the factor to the log average score. 
Therefore, the gap penalties for the log average score in table Dmust be divided by / if 
the score is calculated exactly as in formula ( ITlh . 

3.4 Results 

For each of the three profile scoring system discussed in section El the following test 
were performed using the constructed frequency profiles. In order to assess the superi- 
ority of the profile methods over simple sequence methods we also performed the tests 
for plain sequence- sequence alignment on the original chains using the BLOSUM 62 
substitution matrix and the same gap penalties as for the log average scoring. 



■ Sequence alignment using BLOSUM 62 

□ Profile alignment using dot product scoring 

□ Profile alignment using average scoring w BLOSUM 62 

n Profile alignment using log average scoring w BLOSUM 62 

■ Total 




Fig. 2. Total Fold Recognition Performance. 



Fold Recognition. The goal here is to identify the SCOP fold to which the query pro- 
tein belongs by examining the alignment scores of all 1510 alignments of the query 
profile againsf the other profiles. The scores are sorted in a list together with the name 





20 



Niklas von Ohsen and Ralf Zimmer 



■ Sequence alignment using BLOSUM 62 

□ Profile alignment using dot product scoring 

□ Profile alignment using average scoring w BLOSUM 62 

□ Profile alignment using log average scoring w BLOSUM 62 




fold superfamily family 



Fig. 3. Fold Recognition Performance for Each of the Difficulty Classes. 



of the protein which produced the score and the fold prediction is the SCOP fold of the 
highest scoring protein in the list. Since all the proteins in the list are aligned against 
the same query and the scores are compared, a possible systematic bias of the score by 
special features of the query sequence is not relevant for this test (e. g. length depen- 
dence). The test was performed following a leave-one-out procedure, e. g. for each of 
the 1511 proteins the fold was predicted using the list of alignments against the 1510 
other profiles. The fold recognition rate is then defined as the percentage of all proteins 
for which the fold prediction yielded a correct result. 

Out of the 1511 test sequences log average scoring is able to assign correct folds for 
1181 cases or 78.1%, whereas the usual average scoring correctly predicts 1097 (72.6%) 
and dot product scoring 1024 (67.7%) sequences, both improving on simple sequence- 
sequence alignment with 969 (64.1%) correct assignments. This improvement becomes 
more distinctive for more difficult cases towards the twilight of detectable sequence 
similarity. Figure 0 shows the fold recognition rates for family, superfamily, fold pre- 
dictions separately. Here, all four methods perform well for the easiest case, family 
recognition, with 81.2% for sequence alignment performing worst and log average pro- 
file scoring with 91.5% performing best. For the hardest case of fold detection, log 
average scoring (24.8%) significantly ontperforms (at least 50% improvement) both 
other profile methods (11.1% and 16.2%), whereas sequence alignment hardly is able 
to make correct predictions (6.8%). However, the effect of performance improvement is 
most marked for the superfamily level, where some remote evolutionary relationships 
should, by definition, be detectable via sensitive sequence methods. Here, the new scor- 
ing scheme again achieves a 50% improvement over the second best (average profile 
scoring) methods, thereby increasing the recognition rate from 36.8% to 54.3%. This 
almost doubles the recognition rate of simple sequence alignment (23.0%). 
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A more detailed look on the fold recognition results can be achieved by using con- 
fidence measures which measure the quality of the fold prediction a priori. Here we use 
the 0-score gap which is defined as follows. First the mean and standard deviation for 
the scores in the list are calculated and the raw scores are transformed into 0-scores with 
respect to the determined normal distribution, i. e. the following formula is applied: 

score — mean 

2 score ^ 

standard deviation 

Then the difference of the 0-score between the top scoring protein and the next best 
belonging to a SCOP fold different from the predicted one is calculated yielding the 0- 
score gap. A list L which contains all 1511 fold predictions together with their 0-score 
gap is set up and sorted with respect to the 0-score gap. Entries I G L which represent 
correct fold predictions are termed positives, others negatives. If i is an index in this 
list, figure Elshows the percentage of correct fold predictions if only the top i entries of 
the list are predicted. It also demonstrates a clear improvement of fold prediction sensi- 



o 




Fig. 4. Fold Recognition Ranked with Respect to the 0-score Gap (See Text). 



tivity and specificity for the log average scoring as compared to the competing scoring 
schemes. Again, all profile methods perform better than pure sequence alignment, but 
dot product only shows a slight improvement. 
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Related Pair Recognition. This protocol aims at a slightly different question. The goal 
is to decide whether two proteins have the same SCOP fold hy only looking at the score 
of their profile alignment. Therefore, a good performance in this test means that the 
scoring system is a good absolute measure of similarity between the sequences. Length 
dependency and other systematic biases will decrease the performance of a scoring 
system here. 

The calculations done here also rely on the 1511 lists calculated in the fold recog- 
nition setting. These are merged into one large list following two different procedures: 



- 2 -scores: Before merging, the mean and standard deviation for each of the lists are 
calculated and the raw scores are transformed into 2 -scores as in (El. This setting 
is related with the fold recognition setting since biases introduced by the query 
profile should be removed by the rescaling. 

- raw scores: No transformation is applied. 

The resulting list L contains in each entry I € L a. score score(Z) and the two proteins 
whose alignment produced the score. An entry I G L will be called positive if the two 
proteins have the same SCOP fold and negative if not. The list of 1 511 * 1 510 = 
2 281 610 entries is then sorted with respect to the alignment score and for all scores s 
in the list specihcity and sensitivity are calculated from the following formulas: 



spec(s) 
sens(s) 

The plots of these quantities for the whole range of score values are shown in figure |3 
which clearly exhibits the recognition performance of the new scoring scheme over the 
whole range of specihcities. The ranking of the respective methods is again sequence 
alignment, dot product, average scoring, and log average scoring best, almost doubling 
the performance of average scoring. Using z-scores, sequence alignment and dot prod- 
uct scoring improve somewhat, but still, log average scoring consistently shows doubled 
performance over the second best method. 









J T W, 



#{Z G L I score(Z) > s} 

G L \ I positive, score(Z) > s} 
=ff{lGL\l positive} 



( 14 ) 

( 15 ) 



4 Discussion 

All experiments we performed show a clear improvement of recognition performance 
when using the introduced log average score over average scoring as well as over dot 
product scoring. The results of the fold recognition test are most interesting for the 
protein targets that fall into the superfamily difficulty class since the SCOP hierarchy 
suggests here a “probable common evolutionary origin” which would make this prob- 
lem tractable to sequence homology methods as the ones discussed here. The increase in 
performance over the best previously known profile method (average scoring) achieved 
by using log average scoring becomes as large as 48 % (from 36.8% to 54.3%) and is 
still greater on the fold level. 



Sensitivity Sensitivity 
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Fig. 5. Related Pair Recognition. Top: Specificity-sensitivity plots for the raw scores. 
Bottom: Specificity-sensitivity plots for the z-scores (see text). 
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The pair recognition test for the raw score provides a good measure of how well the 
alignment score represents a quantitative measure for the relationship between two pro- 
teins. The log average score outperforms all other methods here and the plain sequence- 
sequence alignment score even outperforms the dot product method which indicates 
that the latter approach is heavily dependent on some re-weighting procedure like the 
z-score rescaling. When performing this z-score rescaling the average scoring becomes 
significantly worse which is an unexpected effect since the objective is to make the 
scores comparable independent of the scoring method used. It is interesting that the log 
average score shows only a slight improvement here over the raw score performance 
suggesting that the raw score alone is already a good measure of similarity for the two 
profiles. 

In conclusion, we see that the proposed log average score leads to a superior per- 
formance of profile-profile alignment methods in the disciplines fold recognition and 
related pair recognition suggesting that it is a better measure for the similarity of two 
profiles than the previously described other methods tested here. This is the effect of 
simply exchanging the log and the weighted average in the definition of the average 
score. A more general fact might also be learned from this: When a scoring function 
that maps a state to a score is to be extended to a more general setting where a score is 
assigned to a distribution of states, it is not always the best way to simply take the ex- 
pected value (i. e. average scoring). Following this, future developments might include 
an incorporation of the log average scoring into a new scoring approach for protein 
threading as well as an application of the technique in the context of progressive multi- 
ple alignment tools. 
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A Appendix 

Two distinct sets of proteins from the PDB o are used in the described experiments. 
The first one is a set introduced by urmii of 25 1 single domain proteins with known 
structure. It is derived from a non-redundant subset of the PDB introduced by [Ql 
where the sequences have no more than 25 % pairwise sequence identity. From this set 
all single-domain proteins with all atom coordinates available are selected yielding the 
training set S'train of 25 1 proteins (see also [E5J). 

A.l Adjusting Gap Costs 

To provide each scoring approach with appropriate gap penalties we use the iterative 
approach VALP (for Violated Inequality Minimization Approximation Linear Program- 
ming) introduced in lEl which is based on a machine learning approach. We use a 
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training set TR of 8 1 proteins from the data set mentioned above belonging to 1 1 fold 
classes each of which contain at least five of the sequences from TR. In every iteration 
each of the members of TR is used as a query and aligned against all 25 1 protein pro- 
files. If we call the alignments of the best scoring fold class member for each of the 81 
proteins the 8 1 good alignments and all the alignments of each of the 8 1 proteins against 
a member of a different fold class a bad alignment then the iteration tries to maximise 
the difference of the alignment scores between the good and the bad alignments. The 
iterations were stopped when a convergence could be observed which always happened 
before 16 iterations were completed. 

A.2 Construction of Frequency Profiles 

For each amino acid sequence in the two sets a homology search is performed using PSI- 
Blast with 10 iterations against the KIND m database of non redundant amino acid 
sequences. The resulting multiple alignment from the last iteration is restricted to the 
query sequence. A frequency profile is calculated via a sequence weighting procedure 
that minimises the relative entropies of the frequency vectors regarding the background 
amino acid distribution [Ql. Finally, a constant number of pseudo counts is added to 
account for amino acids that may occur by chance at this position. This is necessary 
since the goal is to end up with an estimation of the “true” amino acid distribution in 
a certain position of a protein family and it is not advisable to conclude from a finite 
number of observations which failed to show a certain amino acid that it is impossible 
(zero probability) to observe this amino acid in this position. Finally, all the profiles are 
scaled such that the total probability for all amino acids in each position yields one. 
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Abstract. This paper outlines an algorithm for whole genome order restriction 
optical map assembly. The algorithm can run very reliably in polynomial time by 
exploiting a strict limit on the probability that two maps that appear to overlap are 
in fact unrelated (false positives). The main result of this paper is a tight bound 
on the false positive probability based on a careful model of the experimental 
errors in the maps found in practice. Using this false positive probability bound, 
we show that the probability of failure to compute the correct map can be limited 
to acceptable levels if the input map error rates satisfy certain sharply delineated 
conditions. Thus careful experimental design must be used to ensure that whole 
genome map assembly can be done quickly and reliably. 



1 Introduction 

In the recent years, genome-wide shot-gun restriction mapping of several microorgan- 
isms using optical mapping flslTll have led to high-resolution restriction maps that di- 
rectly facilitated sequence assembly avoiding gaps and compressions or validated shot- 
gun sequence assembly The simplicity and scalability of shot-gun optical mapping 
suggests obvious extensions to bigger and more complex genomes, and in fact, its ap- 
plications to human and rice are underway. Furthermore, a good-quality human map 
is likely to play a critical role in validating several currently available but unverified 
sequences. 

The key computational component of this process involves the assembly of large 
numbers of partial restriction maps with errors into an accurate restriction map of the 
complete genome. The general solution has been shown to be NP-complete, but a poly- 
nomial time solution is possible if a small fraction of false negatives (wasted data) is 
permitted. The critical component of this algorithm is an accurate bound for the false 
positive probability that two maps that appear to match are in fact unrelated. 

The map assembly and alignment problems are related to the much more widely 
studied sequence assembly and alignment problems. The primary difference in the 
problem domains is that the sequence alignment problem involves only discrete data 
in which errors can be modeled as discrete probabilities, whereas map alignment in- 
volves fragment sizing errors and hence requires continuous error models. However, 
even in the case of sequence alignment, statistical significance tests play a key role in 
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eliminating false positive matches and are included in many sequence alignment tools 
such as BLAST (see for example chapter 2 in O). 

A simple bound using Brun’s sieve can be easily derived , but such a bound often 
fails to exploit the full power of optical mapping. Here, we derive a much tighter but 
more complex bound that characterizes the sharp transition from infeasible experiments 
(requiring exponential computation time) to feasible experiments (polynomial compu- 
tation time) much more accurately. Based on these bounds, a newer implementation of 
the Gentig algorithm for assembling genome-wide shot-gun maps [□ has improved its 
performance in practice. 

A close examination shows that the false positive probability bound exhibits a com- 
putational phase-transition: that is, for poor choice of experimental parameters the prob- 
ability of obtaining a solution map is close to zero, but improves suddenly to probability 
one as the experimental parameters are improved continuously. Thus careful optimized 
choice of the experimental parameters analytically has strong implication to experiment 
design in solving the problem accurately without incurring unnecessary laboratory or 
computational cost. In this paper, we explicitly delineate the interdependencies among 
these parameters and explore the trade-offs in parameter space: e.g., sizing error vs. di- 
gestion rate vs. total coverage. There are many direct applications of these bounds apart 
from the alignment and assembly of maps in Gentig: Comparing two related maps (e.g. 
chromosomal aberrations). Validating a sequence (e.g. shot-gun assembly-sequence) or 
a map (e.g., a clone map) against a map, etc. Specific usage of our bounds in these 
applications will appear elsewhere o. 

1.1 A Sub-quadratic Time Map Assembly Algoritbm: Gentig 

For the sake of completeness we give a brief but general description of the basic Gentig 
(GENomic conTIG) map assembly algorithm previously described elsewhere in details 
El- Roughly, Gentig can be thought of as a greedy algorithm that in any step considers 
two islands (individual maps or map contigs) and postulates the best possible way these 
two maps can be aligned. Next, it examines the overlapped region between these two 
islands and weighs the evidence in favor of the hypothesis that “these two islands are 
unrelated and the overlap is simply a chance occurrence.” If enough evidence favors this 
“false positive” hypothesis, Gentig rejects the postulated overlap. In the absence of such 
evidence, the overlap is accepted and the islands are fused into a bigger/deeper island. 
What complicates these simple ideas is that one needs a very quantitative approach 
to calculate the probabilities, the most likely alignment and the criteria for rejecting 
a false positive overlap — all of these steps depending on the models of the error pro- 
cesses governing the observations of individual single molecule maps. Ultimately, the 
Gentig algorithm can be seen to be solving a constrained optimization problem with a 
Bayesian inference algorithm to find the most likely overlaps among the maps subject 
to the constraints imposed by the acceptable false positive probability. False Positive 
constraints limit the search space, thus obviating full-scale back-tracking and avoiding 
an exponential time complexity. As a result, the Gentig algorithm is able to achieve a 
sub-quadratic time complexity. 

The Bayesian probability density estimate for a proposed placement is an approxi- 
mation of the probability density that the two distinct component maps could have been 
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derived from that placement while allowing for various modeled data errors: sizing er- 
rors, missing restriction cut sites, and false optical cuts sites. 

The posterior conditional probability density for a hypothesized placement Ti, given 
the maps, consists of the product of a prior probability density for the hypothesized 
placement and a conditional density of the errors in the component maps relative to 
the hypothesized placement. Let the M input maps to be contiged be denoted by data 
vectors Dj (1 < j < M) specifying the restriction site locations and enzymes. Then 
the Bayesian probability density for Ti, given the data can be written using Bayes rule 
as in SHl: 



MM M 

3 = t j = l j=l 

The conditional probability density function f(Djj7i) depends on the error model used. 
We model the following errors in the input data: 

1 . Each orientation is equally likely to be correct. 

2. Each fragment size in data Dj is assumed to have an independent error distributed 
as a Gaussian with standard deviation a. (It is also possible to model the standard 
deviation as some polynomial of the true fragment size.) 

3. Missing restriction sites in input maps Dj are modeled by a probability pc of an 
actual restriction site being present in the data. 

4. Ealse restriction sites in the input maps Dj are modeled by a rate parameter p f, 
which specifies the expected false cut density in the input maps, and is assumed to 
be uniformly and randomly distributed over the input maps. 

The Bayesian probability density components f{H) and /(Z9j |7f) are computed sep- 
arately for each contig (island) of the proposed placement and the overall probability 
density is equal to their products. For computational convenience, we actually compute 
a penalty function. A, proportional to the logarithm of the probability density as follows: 



Here rrij is the number of cuts in input map Dj . 

For fragment sizing errors, consider each fragment of the proposed contig, and let 
the contig fragment be composed of overlaps from several map fragments of length 
xi, ..., Xn ■ if Pc = 1 and pf = 0 (the ideal situation), it is easy to show that the 
hypothesized fragment size p and the penalty A are: 

p = ^ and A = - pf. 



Now consider the presence of missing cuts (restriction sites) with pc < 1. To model 
the multiplicative error of pc for each cut present in the contig we add a penalty Ac = 
2cr^ log[l /pc] and to model the multiplicative error of (1 — Pc) for each missing cut in 
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the contig we add a penalty An = 2a^ log[l /(I — Pc)] ■ The alignment computed by a 
Dynamic Programming algorithm determines which cuts are missing. 

The computation of p is modified in the case of missing cuts by assuming that 
the missing cuts are located in the same relative location (as a fraction of length) as 
in overlapping maps that do not have the corresponding cut missing. Finally, consider 
the presence of false optical cuts when pf > 0. For each false cut, we add a penalty 
Af — 2cr^ log[l/(p/-\/^cr)] in order to model a “scaled” multiplicative penalty of pf. 
A modified penalty term is required for the end fragments of each map which might be 
partial fragments, as described in lH . When combining contigs of maps rather than in- 
put maps, the Dynamic programming structure is the same, except that the exact penalty 
values are slightly different and computed as the increase in penalty of the new contig 
over the penalty of the two shallower contigs being combined. 

The resulting alignment algorithm has a time complexity of 0{m^m^) in the worst 
case, but an average case complexity of 0{rrii + mj), achieved with several simple 
heuristics. The basic dynamic programming is combined with a global search that tries 
all possible pairs of the M input maps for possible overlaps. A sophisticated implemen- 
tation in Gentig achieves an average case time complexity of 0{\mM] (e = 0.40 
is typical for the errors we encounter), where m is the average value of rrij. It relies on 
several heuristics based on “geometric hashing” while avoiding any backtracking. 



1.2 Summary of the New Results 

Before proceeding further with the technical details of our probabilistic analysis, we 
summarize the two main formulae that can be used directly in estimating the false pos- 
itive probability for a particular map alignment, or in designing a biochemical exper- 
iment with the goal of bounding the false positive probability below some acceptable 
small value (typically < 10“^). 



The Formula for False Positive Prohahility. Consider a population of M ordered 
restriction maps with errors of the kind described earlier. Assume that the best matching 
pair of maps (under a Bayesian formulation) has n aligned cuts and r misaligned cuts, 
and R is some average of the relative sizing error of aligned fragments in the overlap. 
Then FPTr denotes the probability that the two maps are unrelated and the detected 
overlap is purely by chance. 



FPTr < 4 




/2n -I- r -I- 2\ jcr 



where Pn 




Note that if r = 0 (implying that the best match has all the cuts aligned and the only 
error source is sizing error), then FPTq = 4(^)P„. If ii <C 1 then as n gets larger 
FPTq exhibits an exponential decay to 0, and this property remains true for non-zero 
values of r. 
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The Formula for Feasible Genome- Wide Shotgun Optical Mapping. Consider an 
optical mapping experiment for genome-wide shotgun mapping for a genome of size G 
and involving M molecules each of length Ld- Thus the coverage is MLd/G. Let the 
a fragment of true size X have a measured size ^ A/"(X, a^X). Let the average true 
fragment size be L, and the digestion rate of the restriction enzyme be Pd- Thus the 
average relative sizing error R = a y/Pd/L and the average size of aligned fragments 
will be L/Pd^. As usual, let 6 represent the minimum “overlap threshold.” Hence the 
expected number of aligned fragments in a valid overlap is at least n = 9L dPd^ / L. Let 
d = 1/Pd, the inverse of the digest rate. Feasible experimental parameters are those 
that result in an acceptable (e.g., < 10“^) False Positive rate FPT: 



FPT 



2M^ 



\2nd + 2] 
[2n{d — 1)J 




2(d-l)nR 



To achieve acceptable false positive rate, one needs to choose an acceptable value for 
the experimental parameters: Pd, a, Ld and coverage. FPT exhibits a sharp phase tran- 
sition in the space of experimental parameters. Thus the success of a mapping project 
depends extremely critically on a prudent combination of experimental errors (digestion 
rate, sizing), sample size (molecule length and number of molecules) and problem size 
(genome length). Relative sizing error can be lowered simply by increasing L with a 
choice of rarer-cutting enzyme and digestion rate can be improved by better chemistry 

l'5i 

As an example, for a human genome of size G = 3, 300Mb and a desired coverage 
of 6 X , consider the following experiment. Assume a typical value of molecule length 
Ld = 2Mb. If the enzyme of choice is PAC 1, the average true fragment length is about 
2bKb. Assume a minimum overlapQ of 6* = 30%. Assume that the sizing error for a 
fragment of 30fc5 is about 3.0/c6, and hence cr^ = 0.3kb. With a digest rate of Pd = 82% 
we get an unacceptable FPT « 0.0362. However just increasing Pd to 86% results in 
an acceptable FPT « 0.0009. Alternately, reducing average sizing error from 3.0kb to 
2.4fc6 while keeping Pd = 82% also produces an acceptable FPT « 0.0007. 

Obviously one should allow some margin in choosing experimental parameters so 
that the actual experimental parameters will be a reasonable distance from the phase 
transition boundary. This is needed both to allow some slippage in experimental errors 
as well as the possibility that there may be additional small errors not modeled by the 
error model. 



2 A Technical Probabilistic Lemma 

The key to understanding the false positive bound is the following technical lemma that 
forms the basis of further computation. Let X = {x\, . . ., Xn) and Y = {yi, ■ ■ ■, Un) 
be a pair of sequences of positive real numbers, each sequence representing sizes of 
an ordered sequence of restriction fragments. We rely on a “matching rule” to decide 
whether X and Y represent the same restriction fragments in a genome, by comparing 

’ This value should be selected to minimize FPT. 
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the individual component fragments. We proceed by computing a “weighted squared 
relative sizing error” that is then compared to a specific threshold 0. The “weighted 
squared relative sizing error” is simply 



i=l 



f Xi-v. y 

[x, + vj ’ 



where Wi ’s are chosen to match the error model. For example, if the sizing error variance 
for a fragment with true size X is u'^X^ , where p G [0, 2], we can use Wi = 

Lemma 1. Let X = (Xi, . . X„) and Y = (Yi, . . ., Y„) be a pair of sequences of 
IID random variables Xfs and Yfs with exponential distributions and pdf’s f{x) = 

1. Pr(|Xj — Yi\/{Xi + Yi) < O) < 0,for all 0 < 0 and with equality holding, if 
0<l. 

2. < 0) < for all D < 0 and with equality 

holding, if 0 < mini<i<„ Wi. 



Proof. The hrst identity can be shown by integrating the relevant portion of the joint 
distribution of Xi and Yp. 



Pr(|X,-Y|/(X, + Y)<0) 
1 



V i+Q 



I Xi=0 JYi=Xi 



1-0 L2 



T^e ^ dY, dX, = 0. 



Note that this means that for each pair of random fragment sizes Xi, Yi the statistic 
Ui = \Xi — Yi\/{Xi + Yi) is uniformly distributed between 0 and 1. 

We can now compute the overall probability for all n fragment pairs: 



Pn 



Pr(y^ w,( 

i=l 



Xj-Y, 
Xi + Y, 



r 



<0) 



< 0 ) 

i=l 



Note that Ui, . . .,Un are IID uniform distributions over [0,1], hence this probability is 
just that part of the volume of the n-dimensional unit cube that satishes the condition 
WiUi^ < 0. For small sizing errors such that 0 < min('u;i, . . . , w„), this region 
is one orthant of an n dimensional ellipsoid with radius values of ^0jwi in the ith 
dimension. In general this volume is an upper bound and hence: 



P„ < 



(| 0)„/2 

(t)!nr=iV^ 



Here n! is defined in terms of the Gamma function for fractional n: n! = P(n + 1). 

QED 
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Lemma 2. Let X = {X\, . . X„) and Y = (Yi, . . Y„) be a pair of sequences 
such that variables Xi ’s and Yi ’s are given in terms ofllD random variables Zj ’s with 
exponential distributions and pdf’s f{z) = In particular, for i = 1, . . n, if 

we can express Xi and Yi in terms of exponential IID random variables Z\, . . Z^, 
Zn+i, ■ . ; Zr^^si as follows: 

^ Si 

^ k^l 

f'i ^ Si 

iYi3,x.(^Xi, Yf) = ^ ^ Z}^ + — ^ ^ 

k=l k=l 



Then 



1. Pr(|Xi - 

2. Pr(Er=i 



Y|/(X, + Yi) <e)< {‘*+^*-^)0^fforall&> 0. 

'^YXi+Yil (E?=in/2)! 



, for all 0 > 0. 



Proof. Similar to the previous lemma. QED 



3 Model of Random Maps 

Our model of random maps is that cut sites are randomly and uniformly distributed, so 
that the distance between cut sites is a random variable X with an exponential distribu- 
tion and probability density f{x) = , where L is the average distance between 

cut sites. Here we assume that all cut sites are indistinguishable from each other. 

First we consider the case with no misaligned cuts, so that the only errors in the 
proposed overlap region are sizing errors. Thus our alignment data consists of two maps 
with fragment sizes x\, . . ., on one map that align with fragment sizes yi, . . y-n 
on the other map, where n is the number of fragments in the overlap region. Here the 
quality of the alignment will be measured by a weighted squared relative sizing error, 
E = Wi{xi — yfy^ /{xi + where wf^ are chosen as explained earlier. We 
need to compute = Pr(X)i < E), where E = X]” By an 

application of the previous lemma, we have: 

■ 

Here, n! = T(n + 1). For current purposes it suffices to note that ( i)! = For 
example, (§)! = |(i)! = 

To see more clearly how this probability scales with the sizing errors, let us define 
the weighted RMS relative sizing error i?„, and the average weight 




( 2 ) 
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Then we can rewrite Pn using Sterling’s expression for factorials as: 



Pn < 



{Rn!./V^r 



n 




( 3 ) 



This shows that asymptotically the n-fragment false positive probability will de- 
crease with the nth power of the RMS relative error i?„ provided that < -\/2/e7r = 
0.4839. 

To complete our computation of the False Positive Likelihood FP for a particular 
pair of maps Di and D 2 , we need to consider the multiple possible choices of overlaps 
of n or more fragments. Let the two molecules contain Ni and N 2 fragments with 
A^i < N 2 - If n < Ni there will be exactly 4 possible ways of forming an overlap of 
n fragments. Otherwise, if n = there will be 2{N2 — + 1) ways of forming 

an n fragment overlap. Each such overlap has the same independent probability Pn- 
Thus with 4 possible overlaps, the probability FPn of finding at least 1 overlap of n 
fragments between two random maps as good as the actual alignment is bounded by the 
probability FPn < 1 — {1 — Pn)^ < 4P„. 

We also need to consider random overlaps of more than n fragments that are as 
good as the actual overlap. Under a typical Bayesian error model such as described in 
, each overlap of more than n fragments can have slightly larger sizing errors than the 
actual alignment with the same probability density, since the prior probability density 
must be biased towards larger overlaps. For an error model such as in [□ one can show 
that the permissible increase in relative sizing error Rn+i vs. P„ is given approximately 
by: 



n+l-^n+1 



< 



nAnRri^ + K j2 
(n + 1) 



where FT is a prior bias parameter, typically in the range 1 < iL < 1.4. Hence for 
n + fc < we can write FPn+k as: 



FP, 



n-\-k — 



<4P„ 



-Rn 



irAne^/^^-^- 

2Gn 



k/2 



Here is the geometric mean of Wi, i = 1, . . ., n. If n + fc > Ni then FPn+k = 0 
and if n + fc = we just need to replace the factor 4 by 2(A^2 — iVi + 1). 

We can now compute PP by combining overlaps of all possible number of frag- 
ments (ranging over n, . . Ni): 



Ni —n 

FP< ^ FPn+k 

k^O 

= 2Pn(2/{l-Z)+ (N2-Ni- 



1 + ^ 
1 - z 



I TrAnC^GA-nRn 



2Gn 




where,Z = P„ 
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This result applies to the case of two maps. The generalization to a population of many 
maps is considered for the more general case of missing cuts in the next section. 

4 False Positive Probability with Missing Cuts 

When misaligned cuts are present in the actual alignment, the false positive probability 
becomes larger. Assuming the maps are random, we have many possible alignments for 
a given overlap region, greatly increasing the odds of coming up with a good alignment. 

In this case our actual alignment data consists of n pairs of fragment sizes a; i , X 2 , • • • , 
Xn and yi,y 2 , as before, plus the total number of fragments m in the overlap 

region of the two maps, where m = 2n + r and r > 0 is the number of misaligned cuts. 

First consider the case where the number of misaligned cuts is fixed at r = m — 
2n and the number of aligned fragments is fixed af n in each map and we define the 
probability Pn,m as the probability that two random maps with an overlap region of 
exactly m total fragments in both maps could produce an alignment of n fragments as 
good as the actual alignment. 

The key to computing Pn,m is a systematic way to enumerate possible alignments 
that can be applied to each sample of the two hypothesized random maps, then compute 
the probabilities that a particular enumerated alignment will have better sizing error than 
the actual alignment, and combine these probabilities over all enumerated alignments. 

It simplifies matters if we consider random alignments between the left end of two 
random maps and compute the probability of finding an alignmenf involving fhe first m 
fragments from the left (on either of the two random maps) that is as good as the actual 
overlap. 

We now claim that all possible alignments between the left end of two random maps 
involving m fragments can be enumerated, independent of the random map sample, as 
follows: 

1. Pick any n + 1 numbers so,si,...,s„ and another n + 1 numbers ro,ri,...,r„ 
subject only to the constraints Si > 1, G > 1, Yn^Q^Si + n) <m + 2 

2. Align the two random map samples so that their left ends coincide. Then scan 
both maps from the left end and pick the s oth cut site encountered on either map. 
Then scan further to the right on the other map until another r o cut sites have been 
encountered and align the roth one with the previous cut site. This defines the first 
aligned pair of cuts. The map is now re-aligned so that this pair of cuts coincide 
(rather than the left ends of the maps). 

3. Repeat this step for i = 1, . . ., n: Starting from the previous aligned cut site scan 
to the right on both maps until Si sites have been seen and mark the last one, which 
could be on either map. Then scan to the right from that cut site on the opposite 
map only until ri cut sites have been seen and align the last (r^th) one with the 
previously marked cut site. This defines the ith set of aligned fragments. Realign 
the maps so that this pair of cut sites coincide. 

4. After aligning the last ((n + l)th) cut site, scan right on both maps until a total of 
m + 2 cut sites have been seen (including all cut sites seen in previous steps). Mark 
the boundary of the aligned region anywhere between the last seen cut site and the 
next one. 
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For each enumerated alignment defined by a particular choice of s o, ■ • and rg, . . 
r„ we can compute the probability PAn,m,s,r that for two random maps that particular 
enumerated alignment will have relative sizing errors better than the actual map align- 
ment. We can then compute an upper bound for Pn,m as the sum of PAn^m,s,r over all 
enumerated alignments. 

First we compute the probability of no overhang as a function of ro and sq. 

An overhang occurs if the sum of rg random fragments add up to more than the left- 
most fragment in the same molecule. Using IlD random variables Si, ...,6r each drawn 
from to represent the r intervals, and Z\, ..., Zg, also drawn from , 

to represent twice the s intervals used to select the first aligned cut, we can obtain by 
suitable integration: 



Er.g — 



1 (r + s — 2 



3^-1 I 3 s_i 1 ^ 



s-l 

E' 

fe=i 



1 fr + k — 2 



3^= V k-l 



Using the earlier lemma 2.2, We can write PAn^m,s,r as follows: 



P Ayi rn^s,r — Ej-^^sq^^ f ^ ^ 

\i=l 

n 

where E = 

i=l 

and simplify it to 



X,-Y, 
X^ + Y, 



Xt - Vi 
Xi + Vi 



< E 



— xiAjiRji , 



PAn,m,s,r < Er„,soPn{‘2eRn^ 



,n + S 

where N = max Vi, and S = (n — 1). 



(n-|-S-|-l)/2 n ^Si-l-n-1^ 



/ 1\| r-i-l 

= 1 W* 2 



i=l 



Here P„ is the result without misaligned cuts. 

We will now sum up PAn^m,s,r over all possible choices of sq, ..., Sn while keeping 
ro, Tn fixed subject to the constraint X^r=o(®i + < m -f 2 to produce: 



P Aji^m.r — ^ ^ P A>Yi^r 



, fm+l-ro 

“ 2'-o V V 2n + S 



xP„(2ei?„2A„)^/2 



2?T, “h 1 “h tSy 

(n+S+l)/2 n 



n 



n-\- S 



(f)! 



n (i)l^,(n-i)/2 



Next, we will add up PAn,m,r over all possible choices of rg then ri, . . ., r„, where rg 
is constrained byl<rg<m — 2n — S'-!-! and ri, . . ., r„ by S' = ~ 1) < 




False Positives in Genomic Map Assembly and Sequence Validation 



37 



TO — 2n. Approximating Wi by its geometric mean G„ where needed we get: 



ri-.rn ro 

< ^ Pr,{2eRn^Anf^^ 



(n+S+l)/2 / . -| 

/ m + 1 



n 



(f)! 



2n+l + ^y f \ (i)!tt;,G^-i)/2 






The resulting expression diverges for large values of r but the hound is quite tight for 
realistic values of to, n, Rn- 

As a final step in computing the False Positive probability we need to combine the 
Pn,m just computed over random alignments involving fewer misaligned cuts (smaller 
values of r) or more aligned fragments (larger n), as well as consider the possible ways 
the ends of two random maps could be aligned with each other. Using the same approach 
as for the case without misaligned cuts to model the permissible change in sizing error 
we can show that the result is: 



FPr < 4P„ 



2n + r + 2^ 






Rr, 



2A„ 



/ f2n+r+l 
I I V r+l-j 






^2n+r+l^ 




where, 



Z = R„ 



I irAne^G^nRA /2n + r + 3 



2Gn 



2n + 3 



5 False Positive Probability: Population of Maps 



Finally if there are multiple maps to choose from, we need to reject the possibility that 
the proposed map pair is merely the best matching amongst all possible map pairs. We 
need not consider maps with less than n + r/2 fragments. Let the number of maps with 
at least n + r/2 fragments be M, with number of fragments Ni, i = 1, . . M arranged 
in ascending order. For each of the possible M{M — l)/2 map pairs we can compute 
the probability FPr just described, but with iVi, N 2 suitably adjusted. The resulting 
probability FPTr is given by: 



M-l M 

FPTr < E E FPr{N„N,) 

i—1 





' /2n+r+l\ 
V ) 

^2n+r+l^ 



n 
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M-l M 









2=1 ji = 2+l 



where. 



Z = R„ 



1 / 2n + r + 3 



2Gr, 



2tz -t“ 3 




( 4 ) 



(5) 



Which is to be used with the previous equations for P„, Rn,An,Gn and the error model 
parameter K (and implicitly cr). 



6 Experiment Design 



In designing a shot-gun genome wide mapping experiment, one needs to ensure that the 
data allows correct map overlaps to be clearly distinguished from random map overlaps. 
If this is done using a False Positive threshold such as the FPT we have derived in 
this paper, the goal is to ensure that the expected FPT for correct map overlaps does 
not exceed some acceptable threshold (e.g. 10“^). In this section we will estimate the 
expected value of FPT for a valid overlap based on the experimental error parameters. 

In principle we just need to estimate the values of n, r, Rn, M for a correct overlap 
based on the experimental errors. However given the extreme sensitivity of FPT on 
n, the number of aligned fragments, we will compute FPT for correct map overlaps 
of a certain minimum size. By selecting a suitable value of 0, the minimum “overlap 
value”, we can control the expected minimum value of n, at the cost of some reduction 
in effective coverage by the factor 1 — 0 0 - 

In addition to 9 assume the following experimental parameters: G = Expected 
Genome size. Ld — Length of each map. G = Desired coverage (before adjustment 
for 9). L — average distance between restriction site in Genome. aVX — Sizing error 
(standard deviation) for fragment of size X. Pd = The digestions rate of the restriction 
enzyme used. 

Assuming i? 1 and An « we can then write FPT in terms of the experi- 
mental parameters as: 



FPT 



\2nd+2\ \ {R^r 

V[2n(d-l)J/ 



(^l + {d-l)R 




( 6 ) 



where we have d = 4-,n = -r^, R = , and M = 

Pd' LPP' Ld 



1 Simulated Experimental Results 

This section provides some simulated experiments of our algorithm, to illustrate the 
effect of different experimental parameters. To generate simulated tests sets of the map 
data, we begin with a 30 Mb in silica map of chromosome 22 using the PAC I restriction 
enzyme, and sample random pieces of this map averaging 2 Mb in size, and introduce 
the following errors into each such map: 
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- : Each fragment of actual size L is re-sampled from a Gaussian distribution with 
mean L and variance a^L with = 0.3kB. 

- : Each restriction site is retained with an independently simulated digest rate of Pd. 
We ran experiments with Pd = 0.5, 0.6, 0.7, 0.71, 0.72, 0.73, 0.74, 0.75, 0.8 and 
0.9. 

- : Ealse cuts are introduced at a uniform rate of i?/ = 1 per Mb. 

- : Missing small fragments are simulated at a rate of 0.7^ for an L Kb fragment etc. 

- : The orientation of each molecule is randomly flipped with probability of 0.5. 

Enough maps were generated in each experiment to produce a gross coverage of 16x 
(240 maps for the 30 Mb chromosome). Each simulated data set was then run on a pro- 
gram implementing the algorithms described in this paper, using false positive thresh- 
olds of 0.00001. 



Digest-Rate 


# contigs 


Size of the 
largest contig 


# singletons 
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3994 Kb 
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11588 Kb 
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3 


14385 Kb 


88 


0.73 
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30119 Kb 


53 


0.75 


1 


30144 Kb 


47 


0.80 


1 


30075 Kb 


17 


0.90 


1 


30000 Kb 


4 



The contigs generated were checked and verified to be at the correct offset (though 
minor local alignment errors may be present). 



8 Conclusion 

In this paper we derived a tight Ealse Positive Probability bound for overlapping two 
maps. This can be used in the assembly of genome wide maps to reduce the search space 
from exponential time to sub-quadratic time with only a small increase in false nega- 
tives. The Ealse Positive Probability bound also can be used to determine if a sequence 
derived map has a statistically signihcant match with a map. 

We also showed how the False Positive Probability bound can be used to select ex- 
perimental parameters for whole-genome shot-gun mapping that will allow the genome 
wide map to be assembled rapidly and reliably and showed that the boundary between 
feasible and infeasible experimental parameters is quite narrow, exhibiting a form of 
computational phase transition. 

Our approach has certain limitations due to the assumptions underlying our model 
that unrelated parts of the genome will not align with each other except in a random 
manner. This assumption is not true for haplotypes, for example, and hence our algo- 
rithm is not sufficient to produce a haplotyped map. Similarly if some other biological 
process results in strong homologies over large distances our algorithm may merge ho- 
mologous regions of the genome. If this turns out to be problem, explicit postprocessing 
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of the resulting map contigs to look for merged homologous regions or haplotypes can 
be performed. 
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Abstract. Radiation hybrid (RH) mapping is a somatic cell technique that is used 
for ordering markers along a chromosome and estimating physical distances be- 
tween them. It nicely complements the genetic mapping technique, allowing for 
finer resolution. Like genetic mapping, RH mapping consists in finding a marker 
ordering that maximizes a given criteria. Several software packages have been 
recently proposed to solve RH mapping problems. Each package offers specific 
criteria and specific ordering techniques. The most general packages look for 
maximum likelihood maps and may cope with errors, unknowns and polyploid 
hybrids at the cost of limited computational efficiency. More efficient packages 
look for minimum breaks or two-points approximated maximum likelihood maps 
but ignore errors, unknowns and polyploid hybrids. 

In this paper, we present a simple improvement of the EM algorithm Q that 
makes maximum likelihood estimation much more efficient (in practice and to 
some extent in theory too). The boosted EM algorithm can deal with unknowns 
in both error-free haploid data and error-free backcross data. Unknowns are usu- 
ally quite limited in RH mapping but cannot be ignored when one deals with 
genetic data or multiple populations/panels consensus mapping (markers being 
not necessarily typed in all panels/populations). These improved EM algorithms 
have been implemented in the CarJaGene software. We conclude with a com- 
parison with similar packages (RHMAP and MapMaker) using simulated data 
sets and present preliminary results on mixed simultaneous RH/genetic mapping 
on pig data. 



1 Introduction 

Radiation hybrid mapping ® is a somatic cell technique that is used for ordering mark- 
ers along a chromosome and estimating the physical distances between them. It nicely 
complements alternative mapping techniques especially by providing intermediate reso- 
lutions. This technique has been mainly applied to human cells but also used on animals, 
e.g. 0 . 

The biological experiment in RH mapping can be rapidly sketched as follows: cells 
from the organism under study are irradiated. The radiation breaks the chromosomes 
at random locations into separate fragments. A random subset of the fragments is then 
rescued by fusing the irradiated cells with normal rodent cells, a process that produces 
a collection of hybrid cells. The resulting clone may contain none, one or many chro- 
mosome fragments. This clone is then tested for the presence or absence of each of the 



O. Gascuel and B.M.E. Moret (Eds.): WABI 2001, LNCS 2149, pp. 41-0] 2001. 
(?) Springer- Verlag Berlin Heidelberg 2001 



42 



Thomas Schiex et al. 



markers. This process is performed a large number of times producing a radiated hybrid 
panel. 

The algorithmic analysis which follows the biological experiment, based on the re- 
tention patterns of the markers observed, aims at hnding the most likely linear ordering 
of the markers on the chromosome along with distances between markers. The under- 
lying intuition is that the further apart two markers are, the most likely it is that the 
radiation will create one or more breaks between them, placing the two markers on 
separate chromosomal fragments. Therefore, close markers will tend to be more often 
co-retained than distant ones. Given an order of the markers, the retention pattern can 
therefore be used to estimate the pairwise physical distances between them. 

Two fundamental types of approaches have been used to evaluate the quality of 
a possible marker’s permutation. The hrst, crudest approach, is a non parametric ap- 
proach, called “obligate chromosomal breaks” (OCB), that aims at finding a permu- 
tation that minimizes the number of breaks needed to explain the retention pattern. 
This approach is not considered in this paper. The second one is a statistical paramet- 
ric method of maximum likelihood estimation (MLE) using a probabilistic model of 
the RH biological mechanism. Several probabilistic models have been proposed for RH 
mapping [|9| dealing with polyploidy, errors and unknowns. In this paper, we are only 
interested in a subset of the models that are compatible with the use of the EM algo- 
rithm m for estimating distances between markers. According to our experience, the 
simplest “equal retention model” is the most frequently used in practice and also the 
most widely available because it is a good compromise between efficiency and realism. 
Such models are used in the RHMAP and RHMAPPER packages. More recently, more 
efficient approximated MLE versions based on two points estimation have also been 
used m but they don’t deal with unknowns and won’t be considered in the sequel. 

The older but still widely used genetic mapping technique m exploits the occur- 
rence of cross-overs during meiosis. As for RH mapping, the underlying intuition is 
that the further apart two markers are, the most likely it is that a cross-over will occur in 
between. Because cross-overs cannot be directly observed, the indirect observation of 
allelic patterns in parents and children is used to estimate the genetic distance between 
them. There is a long tradition of using EM in genetic mapping [0. This paper will 
focus on RH mapping but we must mention that the improvements presented in this 
paper have also been applied to genetic mapping with backcross pedigree. Actually, 
genetic and RH data nicely complement each other for ordering markers. Genetic data 
leads to myopic ordering: set of close markers cannot be reliably ordered because usu- 
ally no recombination can be observed between them. On the contrary RH data leads to 
hypermetropic ordering: set of closely related markers can be reliably ordered but dis- 
tant groups are sometimes difficult to order because too many breaks occurred between 
them. Dealing with unknown is unavoidable in genetic mapping since markers may be 
uninformative. 

In either RH or genetic mapping, the most obvious computational barrier is the 
shear number of possible orders. Eor n markers, there are ^ possible orders (as an 
order and its reverse are equivalent), which is too large to search exhaustively, even for 
moderate values of n. In the simplest case of error-free unknown-free data, it has been 
observed by several authors that the MLE ordering problem is equivalent to the famous 
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traveling salesman problem [C3D, an NP-hard problem. The ordering techniques used 
in existing packages go from branch and bound m, to local search techniques and more 
or less greedy heuristics. In all cases, hnding a good order requires a very large number 
of MLE calls. In practice, the cost of EM execution is still too heavy to make branch 
and bound or local search methods computationally usable on large data sets and the 
greedy heuristic approach remains among the most widely used in practice. 

In this paper, we show how the EM algorithm for RH/genetic mapping can be sped 
up when it is applied to data sets where each information is either completely informa- 
tive or completely uninformative. This is the case, e.g. for error-free haploid RH data 
with unknown or error-free backcross data with unknowns. In practice, for RH map- 
ping, it has been observed in IE) that “analyzing polyploid radiation hybrids as if they 
were haploid does not compromise the ability to order markers” which makes this re- 
striction to haploid data quite reasonable. Eor genetic mapping, most phase-known data 
can be reduced (although with some possible loss of information) to backcross data. 
In practice, we have applied it to diploid RH data and complex animal (pig) pedigree 
with very limited discrepancies (see section and a speed-up factor of two orders of 
magnitude. 

Interestingly, this boosted EM algorithm is especially adapted to the unknown/ 
known patterns that appear in multiple population/panels mapping when a single con- 
sensus map is built: when two or more data sets are available for the same organism 
(and with a similar resolution for RH), a possible way to build a consensus map is to 
merge the two data sets in one. markers that are not typed in one of the 2 data sets are 
marked as unknown in this case. In this case, we show that each iteration of the EM 
algorithm may be in 0{n) instead of 0{n.k), where n is the number of markers and k 
the number of individuals typed. 

These improved EM algorithms along with several TSP-inspired ordering strategies 
for both framework and comprehensive mapping have been implemented in the free 
software package Car^aGene |Tn]| which allows for multiple population/panels map- 
ping using either shared or separate distance estimation for each pair of data sets. This 
allows, among other, for mixed genetic/RH mapping (with separate estimations of ge- 
netic and RH distances) which nicely exploits the complementarity of genetic and RH 
data. 



2 The EM Algorithm for RH Data 

In this section, we will explain how EM can be optimized to deal with haploid RH data 
sets with unknowns. This optimization also applies to backcross genetic data sets with 
missing data. It has been implemented in the genetic/RH mapping software 
Car^JaGene ifTil but has never been described in the literature before. 

Suppose that genetic markers Mi,. . . M„ are typed on k radiation hybrids. The ob- 
servations for a given hybrid, given the marker order (M i , . . . , M„) can be written as a 
vector X = . . . , Xn) where Xi = 1 if the marker Mi is typed and present, Xi = 0 

if the marker is typed and absent and Xi = — if the marker could not be reliably typed. 
Such unknowns are relatively rare in RH mapping but are much more frequent in ge- 
netic mapping or in multiple population/panel consensus mapping. 
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The probabilistic HMM model for generating each sequence x in the case of error- 
free haploid “equal retention” data is defined by one retention probability denoted r 
(probability for a fragment to be retained) and n — 1 breakage probability denoted 
bi, . . . , bn-i (bi is the probability of breakage between marker Mi and Mi+i). Break- 
age and retention are considered independeiQtrocesses. 

The structure of the HMM model for 4 markers ordered as Ml,. . .,M4 is sketched as 
a weighted digraph G = {V,E)m Figure 1. Vertices correspond to the possible state of 



Ml M2 M3 M4 




Fig. 1. A Graph Representation of the “Equal Retention Model” for RH Mapping. 



being respectively retained, missing or broken. An edge (a, b) G E that connects two 
vertices a and b is simply weighted by the conditional probability of reaching the state 
b from the state a, noted p{a, b). For example, if we assume that Mi is on a retained 
fragment, there is a probability b i that a new fragment will start between M i and M 2 
and a probability (1 — bi) that the fragment remains unbroken. iQthe first case, the 
new fragment may either be retained (r) or not (1 — rlHi the second case, we know 
that M 2 is on the same retained fragment. The EM algorithm [5] is the usual choice 
for parameter estimation in hidden Markov models [11] where it is also known as the 
Forward/Backward or Baum-Welsh algorithm. This algorithm can be used to estimate 
the parameters r, 61, 62 . . . and to evaluate the likelihood of a map given the available 
evidence (a vector x of observation for each l^^brid in the panel). 

If we consider one observation x = (0, — , 0, 1) on a given hybrid, the graph can 
be restricted to the partial graph of Figure 2 by removing vertices and edges which are 
incompatible with the observation (dotted in the figure). Every path in this graph corre- 
sponds to a possible reality. The path in bold corresponds to the fact that a fragment has 
been a breakage between each pair of markers, each fragment being successively miss- 
ing, retained, missing and retained. If we define the probability of such a source-sink 
path as the product of all the edge probabilities, then the sum of the probabilities of all 
the paths that are compatible with the obserEiSon is precisely its likelihood. Although 
there is a worst-case exponential number of such paths, dynamic programming, embod- 
ied in the so-called “Forward” algorithm [11] may be used to compute the likelihood of 
a single hybrid in 9(n) time and space. For any vertex v, if we note Pi{v) the sum of 
the probabilities of all the paths that exist in the graph from the source to the vertex v, 
we have the following recurrence equation; 
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Ml M2 M3 M4 




Fig. 2. The Graph Representation for x = (0,— ,0,1). 



Pi(v)= Pi{u).p{u,v) 

u s.t. {u^v)^E 

This simply says that in order to reach v from the source, we must first reach a vertex 
u that is directly connected to v (with probability Pi{u)) then go to v (with probability 
p{u, v)). We can sum up all these probabilities that correspond to an exhaustive list of 
distinct cases. This recurrence can simply be used by initializing the probability Pi of 
the source vertex to 1.0 and applying the equation “Forward” (from left to right, using 
a topological ordering of the vertices). Obviously, Pi for the sink vertex is nothing but 
the likelihood of the hybrid. One should note that the same idea can be exploited to 
compute for each vertex Pr{v), the sum of the probabilities of all paths that connect v 
to the sink (simply reverse all edges and apply the “forward” version). 

The EM algorithm is an iterative algorithm that starts from given initial values of 
the parameters rg, 6o, 6 q . . . and that goes repeatedly through two phases: 

1 . Expectation: for each hybrid h G H, given the current value of the parameters, the 
probability Ph{u, v) that a path compatible with the observation x for hybrid h uses 
edge (m, v) can simply be computed by: 

Ph{u,v) = Pl{u).p{u,v).Pr{v) 

If for a given parameter p, and given hybrid h, we note S ^ (p) the set of all edges 
weighted by p and {p) the set of all edges weighted by 1 — p, an expected number 
of occurrence of the corresponding event can be computed by: 



h(^H ^(“>G6S+(p)US-(p) Ph{u,v) 

2. Maximization: the value of each parameter p is updated by maximum likelihood 
under the assumption of complete data by: 



E{p) 



Pi+i = 



k 
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It is known that EM will produce estimates of increasing likelihood till a local maxi- 
mum is reached. The usual choice is to stop iteration when the increase of log-likelihood 
is lower than a given tolerance. Several iterations are usually needed to reach, e.g. tol- 
erance 10“^, especially as the number of unknowns increases. 

Each of the forward, backward and E{p) computation of the E phase successively 
treat each pair of adjacent markers in constant time (there are at most 6 edges between 
each pair of markers). This will be called “steps” in the sequel with the aim of getting 
a better idea of complexity than Landau’s notation can offer (and which can anyway be 
derived from the number of steps needed since each step is constant time). From the 
previous simplified presentation of EM, we can observe that each EM iteration needs 
3(n -f l)k steps since n + 1 steps are required for the Forward phase, the Backward 
phase and the E{p) computation phase. The M phase is in 6{n) only. 

2.1 Speeding Up the E Phase 

To make the E phase more efficient, the idea is to try to sum up the data for all hybrids 
in a more concise way in order to try to avoid, as far as possible the k factor in the 
complexity of the E phase. The crucial property that is exploited is that when a loci 
status is known (0 or 1), then the probabilities Pi and Pr for the corresponding “Re- 
tained” vertex are both equal to 0.0 and 1.0 respectively (1.0 and 0.0 respectively for 
the “Missing” vertex) and this independently of the markers around. 

Any given hybrid can therefore be decomposed in segments of successive loci of 3 
different types as illustrated in Figure 0 



dangling 

left 

1 



bounded 




0 



known 

0 ^ 



1 



dangling 

right 




Fig. 3. Decomposing Hybrid into Segments. 



- dangling segments are either segments that start at the first loci and are all of un- 
known status except the rightmost one (dangling left) or segments that end at the 
last loci and are all of unknown status except the leftmost one (dangling right). 

- known pairs are segments composed from a pair of adjacent loci which are both 
of known status. 

- bounded segments are segments that start and stop at loci of known status but 
which are separated by loci of unknown status. 

Given the hybrid data set H, it is possible to precompute for all pairs of markers 
the number of known pairs of each type. This is done only once, when the data set is 
loaded. For a given loci ordering, we can precompute in a 0{n.k) phase the number 
of dangling and bounded segments of each type that occurs in the data set. Then the 
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EM algorithm may iterate and perform the E phase by computing expectations for each 
of the cases and multiplying the results by the number of occurrences. Maximizing 
remains unchanged. 

For known pairs, the expectation computation can be done in one step and there are 
at most 4(n — 1) different types of pairs which means 4(n — 1) steps are needed. For all 
other segments, dangling or bounded, the expectation computation needs a number of 
steps equal to the length of the fragment. So, if we note u the total number of unknowns 
in the data set, the total length of these fragments is less than 3m and the expectation 
computation can be done in at most 9m steps. We get an overall number of steps of 
4(n — 1) + 9m which is usually very small compared to the 3(n + l)fc needed before. 
From an asymptotic point of view, this is still in 0{nk) because u is in 0{nk) but it does 
improve things a lot in practice. Also, decomposing hybrids into fragments guarantees 
that repeated patterns occurring in two or more hybrids are only processed once. 

There is a specific important case where asymptotic complexity may improve. When 
multiple population/panels consensus mapping is performed, a marker which is not 
typed in a data set will be unknown for all the hybrids/individuals of the data-set. This 
may induce dangling and bounded segments that are shared by all the hybrids/indivi- 
duals of the data-set but that will be processed only once at each EM iteration. In this 
case, known pairs and all dangling/bounded segments induced by data-set merging will 
be handle in 0{n) time instead of 0{nk). 

3 The Car^^aGene Package 

This improved EM algorithms have been implemented in the Car^aGene package 
ca. Beyond RH and backcross data, CarJaGene can also handle genetic intercross, 
recombinant inbred lines and phase known outbred data. Although perfectly able to han- 
dle single population mapping, Car^aGene is oriented towards multiple population 
mapping: data sets can be hierarchically “merged” and the user decides which data sets 
share (or not) distance estimations. Markers ordering are always evaluated using a true 
maximum likelihood multipoint criteria. Beyond multiple population genetic mapping 
or multiple panels RH mapping, CarJaGene allows to perform mixed genetic/RH 
mapping by merging genetic and RH data and estimating genetic/RH distances sepa- 
rately using a common loci order. 

Considering loci ordering, CarJaGene offers a large spectra of tools for building 
and validating both comprehensive and framework maps. Most of these tools have been 
derived from the famous Traveling Salesman Problem (TSP) solving technology. From 
its beginning [fm . CarJaGene has exploited the analogy between backcross genetic 
mapping and the TSP. As it has been observed in [Q, this analogy also exists for RH 
mapping. C ArJaGene has been implemented in C-H- and has a Tcl/Tk based graphical 
interface. It is available on the web at www-bia.inra.fr/T/CarthaGene. It runs on most 
Unix platforms and Windows. 
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4 Empirical Evaluation of the Boosted EM Algorithm 

To better evaluate the gains obtained by the improved EM algorithm, we have compared 
it to the EM implementations available in the RHMAP RH mapping package [El and 
in the MapMaker genetic mapping package 001. Since we only wanted to compare EM 
efficiency and not loci ordering efficiency, all tests have been done by estimating the 
log-likelihood of a fixed loci ordering using the same convergence tolerance, the same 
starting point, on a Pentium II 400Mhz machine running Linux, using GNU compilers 
(respectively g77, gcc and g++ for RHMAP, MapMaker and Car JaGene), on 1000 
EM calls. 

Simulated data sets have been generated using the underlying probabilistic model 
both for backcross and RH data: 

- the first tests have been done single panel/population data. The RH data uses 50 
markers evenly distributed on a 10 ray long chromosome, 100 hybrids and 5% of 
unknowns. The genetic data, uses 50 markers evenly distributed on a 1 Morgan long 
chromosome., 200 individuals and 25% of unknowns. This corresponds to typical 
framework mapping situations. 

- the second tests have been done by merging two panels/population. The data sets 
have been generated using the same parameters except that each data set shares half 
of the markers with the other data set. When a marker is not typed in a panel/popu- 
lation, the corresponding hybrid/individual is marked as unknownQ 
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3.3 



For radiated hybrid data, the speed-up exceeds one order of magnitude. More mod- 
est improvements are reached on genetic data. These improvements may reduce day of 

' Note that for RH data, this situation corresponds to the merging of two panels that have been 
irradiated using a similar level of radiation. If this is not the case, one should rather perform 
separate distance estimations per panel. More complex models, using proportional distances, 
are available in RHMAP but are not compatible with the use of the EM algorithm. 
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computation time to hours and enable the use of more sophisticated ordering techniques, 
without any approximation being made. These numbers still leave room for improve- 
ments since CarJaGene does not exploit the strategy of precomputing known pairs 
once and for all but recomputes them at each EM call. 

5 Application to Real Genetic and Radiation Hybrid Data 

The improvements obtained apply only to haploid RH data and phase known backcross 
genetic data. This could be considered as quite restrictive. In this section we show how 
RH polyploid data and complex genetic data can be reduced to these cases and evaluate 
the impact of these reductions. This also illustrates how genetic and RH data can be 
mixed in order to exploit the different resolutions for marker ordering. 

Considering genetic data, the USDA porcine reference pedigree consists in a two 
generations backcross population of 10 parents (2 males from a White composite line 
and 8 FI females) and 94 progeny aa. The height FI females were obtained after 
mating White composite females with Duroc, Fengjing, Meishan or Minzhu boars. To 
reduce this to backcross data, phases have been set to the most probable phase, identified 
using Chrompic from the CRIMAP package. Then the data is encoded as two backcross 
(one for the paternal allele and one for the maternal allele). The original data set can be 
processed by CRIMAP (but not by MapMaker). 

Considering RH data, the IMpRH panel consists in 1 1 8 radiated hybrid clones pro- 
duced after irradiation of porcine cells at 7000 rads r ll4i . Using this panel a first genera- 
tion whole genome radiation map has been established using 757 markers [0] including 
699 markers already mapped on the genetic map from [QH. 

These two data-sets have been merged and a framework consensus order built for 
chromosome 1 using the buildfw command of CarJaGene. The resulting order 
contains 38 markers. This order has been validated using simulated annealing and other 
usual techniques (local permutations, swapping of markers. . . ) and could not be im- 
proved, the second best order having a log-likelihood 4.48 below the best. The simu- 
lated annealing step alone took few hours on Pentium-II 400Mhz and could probably 
not have been done in reasonable time using standard EM algorithms. 

Using CarJaGene instead of CRIMAP/RHMAP, we made the assumptions that 
using an haploid model on diploid data, fixing the phase and considering outbreds as 
double backcross does not change differences in likelihood too much. In order to check 
these assumptions, we compared CarJaGene to CRIMAP and RHMAP applied sep- 
arately on the 4 best orders identified by Car JaGene. We used Car^aGene haploid 
model, RHMAP haploid model and RHMAP diploid model on the RH data alone. The 
following table indicates the log-likelihoods obtained in each case and the differences 
in log-likelihood with the best order. The results are consistent with our assumption. 
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The same comparison was done using CRIMAP outbred model and Car JaGene back- 
cross model on the derived double backcross data. The results obtained are again con- 
sistent with our assumption. Note that the important change in log-likelihood is not 
surprising; fixing phases brings in information, while double backcross projection re- 
moves some information. The important thing is that differences in log-likelihood are 
not affected. 





Order I Order 2 Order 3 Order 4 


CRIMAP 

CarJaGene 


-366.54 -370.26 -366.54 -379.22 
-422.29 -426.01 -422.29 -434.97 


Dif. CRIMAP 
Dif. Cartha. 


0.00 -3.72 0.00 -12.68 

0.00 -3.72 0.00 -12.68 



We completed this test by a larger comparison, using more orders, and it appears that 
the differences in log-likelihood are well conserved: a difference of difference greater 
than 1.0 was observed only for orders whose log-likelihood was very far from the best 
one (more than 10 LOD). CarJaGene can thus be used to build framework and com- 
prehensive maps, integrating genetic and RH maps in reasonable time. For better fi- 
nal distances between markers, one can simply reestimate them with RHMAP diploid 
model and CRIMAP. 



6 Conclusion 



We introduced a simple improvement for the EM algorithm in the framework of PIMM 
based RH and genetic maximum likelihood estimation. One of the limitation of this 
improvement is that it can only deal with observations that either completely determine 
the hidden states or leave the hidden state undetermined (unknown). It can therefore not 
be extended, e.g. to handle intercross genetic data, diploid RH data or to models that 
explicitly represent typing errors [0. However, as we experienced on real data, these 
restrictions can be easily be dealt with. 

This is especially attractive considering that our application to haploid radiation hy- 
brid and backcross genetic mapping shows that the boosted EM algorithms can lead 
to speed-ups of more than one order of magnitude compared to the traditional EM ap- 
proach, without any loss in accuracy. Ordering techniques such as simulated annealing 
or taboo search requires an important number of calls to the evaluation function. The 
efficiency of the boosted EM algorithm makes the application of such approaches prac- 
tical, even in the presence of unknowns. This is crucial for multiple population/panel 
mapping as it has been done, e.g. in f151 . 
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Abstract. We describe the theoretical basis of an approach using microarrays of 
probes and libraries of BACs to construct maps of the probes, by assigning rel- 
ative locations to the probes along the genome. The method depends on several 
hybridization experiments: in each experiment, we sample (with replacement) a 
large library of BACs to select a small collection of BACs for hybridization with 
the probe arrays. The resulting data can be used to assign a local distance metric 
relating the arrayed probes, and then to position the prohes with respect to each 
other. The method is shown to he capable of achieving surprisingly high accu- 
racy within individual contigs and with less than 100 microarray hybridization 
experiments even when the probes and clones number about 1 &, thus involving 
potentially around 10^'^ individual hybridizations. 

This approach is not dependent upon existing BAC contig information, and so 
should be particularly useful in the application to previously uncharacterized 
genomes. Nevertheless, the method may be used to independently validate a BAC 
contig map or a minimal tiling path obtained by intensive genomic sequence de- 
termination. 

We provide a detailed probabilistic analysis to characterize the outcome of a sin- 
gle hybridization experiment and what information can be garnered about the 
physical distance between any pair of probes. This analysis then leads to a for- 
mulation of a likelihood optimization problem whose solution leads to the rel- 
ative probe locations. After reformulating the optimization problem in a graph- 
theoretic setting and by exploiting the underlying probabilistic structure, we de- 
velop an efficient approximation algorithm for our original problem. We have 
implemented the algorithm and conducted several experiments for varied sets of 
parameters. Our empirical results are highly promising and are reported here as 
well. We also explore how the probabilistic analysis and algorithmic efficiency 
issues affect the design of the underlying biochemical experiments. 



1 Introduction 

Genetics depends upon genomic maps. The ultimate maps are complete nucleotide se- 
quences of the organism together with a description of the transcription units. Such 
maps in various degrees of completion exist for many of the microbial organisms. 
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yeasts, worms, flies, and now humans. Short of this, genetically or physically mapped 
collections of objects derived from the genome under study are still of immense utility, 
and are often precursors to the development of complete sequence maps. These objects 
may be markers of any sort, DNA probes, and genomic inserts in cloning vectors. 

We have been exploring the use of microarrays to assist in the development of ge- 
nomic maps. We report here one such mapping algorithm, and explore its foundation 
using computer simulations and mathematical treatment. The algorithm uses unordered 
probes that are microarrayed and hybridized to an organized sampling of arrayed but 
unordered members of libraries of large insert genomic clones. 

In the foregoing we assume some knowledge of genome organization, DNA hy- 
bridization, repetitive DNA, gene duplication, and the common forms of microarray 
experiments. In the proposed experimental setting, one sample at a time is hybridized 
to microarrayed probes, and hybridization is measured as an absolute quantity. We as- 
sume probes are of zero dimension, that is, of negligible length compared to the length 
of the large genomic insert clones. Most importantly, we assume that hybridization sig- 
nal of a probe reflects its inclusion in one or more large genomic insert clones present in 
the sample, and negligible background hybridization. Our analysis is general enough to 
include the effects of other sources of error. The novelty of the results reported here is 
in their ability to deal with ambiguities, an inevitable consequence of the use of massive 
parallelism in microarrays involving many probes and many clones. Similar algorithms 
are reported in the literature [Q, but assume only the knowledge of clone-probe inclu- 
sion information for every such combination and suggest different algorithms that do 
not exploit the underlying statistical structure. 

One important application of our method is in measuring gene copy number in 
genomic DNA mi . Such techniques will eventually have direct application to the anal- 
ysis of somatic mutations in tumors and inherited spontaneous germline mutations in 
organisms when those mutations result in gene amplification or deletion. In contrast, 
low signal-to-noise ratios, due to the high complexity of genomic DNA, make other 
approaches such as the direct application of standard DNA microarray methods highly 
problematic. 

2 Related Literature 

The problem of physical mapping and sequencing using hybridization is relatively well 
studied. As shown in [H, the general problem of physical mapping is NP-complete. An 
approach based on traveling salesman problem (TSP) in the absence of statistics is given 
in im . The problem formalism used in this paper will be similar to the foundational work 
in I1I2I3I5I7I11I12I14II . Our method extends the previous results by devising efficient 
algorithms as well as biochemical experiments capable of achieving higher resolution of 
probe placement within contigs. In [Si! the Matrix-To-Line problem is suggested as 
the model problem for determining Radiation Hybrid maps. Probe mapping using BACs 
is slightly different in that pairwise distances for probes far away cannot be resolved 
directly using BACs. Our design is general in that the inputs are modeled as random 
variables with known statistics determined a priori by our experimental designs chosen 
appropriately for a range of applications. Also we provide estimated probabilities of 
correctness for the map we produce. In this sense, this paper invokes an an experimental 
optimization as recommended in o. 
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3 Mathematical Definitions 

Given a set of P probes listed as {pi,p 2 , ■ ■ ■ ,pp} and contained in some contiguous 
segment of the genome we define a probe map to be a pair of sequences, ordering 
= {p^(i),p^( 2 ), . . . ,Ptt(p)} and position = {x\, X 2 , ■ ■ ■, xp}. The position sequence 
infers the positions of the probes and the ordering sequence is determined by the per- 
mutatiorQ n G Sp that sorts the given list of probes by position. 

However the underlying correct position of each prohe remains unknown. We infer 
probe maps approximating the correct positions as best as possible from an experimen- 
tal set of data which is stochastic. Experimental data sets are represented by graphs; 
given a set of probes {pi,P2, ■ ■ ■ ,Pp}, let V be the set of indices. Then a pairwise dis- 
tance graph is an undirected graph Q = {V, E), E C V xV where each edge Cij maps 
to a distance dij between probe i and probe j. 

We model various experimental errors arising from the hybridization experiment 
used to measure probe to probe distance. With the model we can understand the dis- 
tribution of pairwise distance graphs as a random variable. Under certain parameters 
we can implement Bayes formula to build a Maximum Likelihood Estimator (MLE) 
for probe map reconstruction. With the MLE established we attempt to optimize the 
computation involved for practical implementation. 



3.1 Experimental Procedure 

Consider a genome represented by the interval [0, G]. Take P random short sub-strings 
(about 200bps) which appear on the genome uniquely. Represent these strings as points 
{xi, . . . ,xp}. Assume that the probes are i.i.d. with uniform random distribution over 
the interval [0, G]. Let S' be a collection of intervals of the genome, each of length L 
(usually ranging from few lOOkbs to Mbs). Suppose the left-hand points of the intervals 
of S are i.i.d. uniform random variables over the interval [0, G]. Take a small, even in 
number sized subset of intervals S' C S, chosen randomly from S. Divide S' randomly 
into two equal-size disjoint subsets S' = S'^ U S'q, where R indicates a red color class 
and G indicates a green color class. Now specify any point x in [0, G] and consider the 
possible associations between x, and the intervals in S'\ 

- X is not covered by any interval in S". 

- X is covered by at least one interval of S'^ but no intervals of S'q. 

- X is covered by at least one interval of S'q but no intervals of 

- X is covered by at least one interval of 5"^ and at least one interval of S'q. 

If we perform a sequence of M such experiments then for each x we get a sequence of 
M outcomes represented as a color vector of length M. We are interested in observing 
sequences of such outcomes on the set {xi, . . . , xp}. 

For DNA the short sub-strings can be produced with the use of restriction enzymes, 
or synthesized as oligoes. The collection of covering intervals may be provided by a 
BAG or YAC clone library. The division of a random sample taken from the clone 
library may be done with phosphorescent molecules added to the DNA and visible 

’ We denote the permutation group on P indices as Sp. 
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with a laser scanner. Hybridization microarrays allow us to observe such an outcome 
sequence for each of the 100,000 probes in a constant amount of time. 

Consider an example with human. To make a set of Human Oligoe Probes we may 
use restriction enzymes to cut out P probe substrings of size 200bp to 1200bp from the 
genome and choose a low complexity representation (LCR) as discussed in irmra .we 
may arrange for a sequence of M random samples from the BAC library, suppose each 
sample has K BACs and coverage c = Samples are then randomly partitioned into 
two color classes E = {i?, G}, and then hybridized to a microarray, arrayed with P 
probes. If we pick one probe pi, then the possible outcomes for one experiment are: 

- Pi hybridizes to zero BACs. We say the outcome is ‘B’ (blank). 

- Pi hybridizes to at least one red BAC and zero green BACs. We say the outcome is 
‘R’ (red). 

- Pi hybridizes to at least one green BAC and zero red BACs. We say the outcome is 
‘G’ (green). 

- Pi hybridizes to at least one green BAC and at least one red BAC. We say the 
outcome is ‘Y’ (yellow). 

We call these events is, in, ic, and iy respectively. We use M random samples to com- 
plete the full experiment. The parameter domain for the full experiment is (P, L, K, M ) , 
where P is the number of probes, L is the average length of the genomic material used 
(for BACs, L = 160kb), K is the sampling size, and M is the number of samples. The 
output is a color sequence for each probe. The sequence corresponding to probe p j is 
Sf = {sj,k)f=i withsj,fc G {P,P, G,y}. 

How the Distances Are Measured. With the resulting color sequences Sj we can 
compute the pairwise Hamming distance. Let 

Hi,j = # places where and Sj differ , 

Cij = # places where and Sj are the same but Si ^ B, 

Tij = # places where and Sj are B. 



The Hamming distance defines a distance metric on the set of probes. 

Lemma 31. Consider an experiment with parameters (P, P, AT, M), and c = Let 
i and j be arbitrary indices from the clone set and Xij is the actual distance ( in number 
of bases) separating probe pi from probe pj on the genome. Let Xij = min{a;ij, L}. 
Then: 



iT,., ^ Bin {M, + 0{{x,,f)) 

±j 

Gij ~ Bin {M, 1 - e”'" + - 2e~^)x^j + 0((%)^)) 

Tij ~ Bin 
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Proof. See appendix. 

These computations for small x lead to an accurate estimator: 

c. 

Corollary 32. The estimator ofxij given by Xij = Hij is good in the sense that 
there are values of c so that: 

< 

as M oo. 

>L; 

witha^ = (If)- 

Proof It is based on a standard approximation. We have developed Chernoff bounds to 
analyze tradeoff between parameters K (determining c), and M. For x < ^ one can 
show that for nearly any value of c the above convergence in distribution is significantly 
rapid w.r.t. M. □ 

We have developed an estimator of Xij given by Xij = ^ .+ 2 C ^ ^ 

estimator takes into account the variation of sample coverage over the genome. 



/(iy = d\xij) 



^/27r<T.yxij 

1 £ 

'J2 tT( 7\/T 



(d-Xi 

' 



if Xij 

ifx^j 



Lemma 33. The distribution for distance d is a function ofx and is approximated by 

^-{d-xfj2a^x ^-(d-Lf/2a^L 

f{d\x) = Io<x<L y== 1- lL<a;<G 1=== • 

\J iTixa s/zirLa 

Proof Simple restatement of corollary 2.2 □ 

Since we have assumed that any given probe is distributed uniformly randomly over 
the genome, the density function for the probe’s position is: 



/(X) = ^ 

Our next lemma is an application of Bayes’ formula to compute f{x\d) from f{x) 
and f{d\x) computed above. 

, , , -(d-x)'^ /2 (t‘^x -(d-L)"^ L 

Lemma 34. If f{d\x) = Io<x<l~ — . 'rlL<x<G~ • Then 



f{x\d)^ld<L- 



/ 2 TTxa 
^—{x—d)'^j2a'^d 



/2-KLa 



^/2ndt 



+ Id>L ^L<x<G 



(J 



G-L 



Proof See appendix □. 

With conditional f{x\d) we can now define the Maximum Likelihood Estimation 
problem: 

Given an arbitrary pair-wise distance edge weighted complete graph Q of P vertices, 
representing probes, and each edge {i,j) labeled with djj, a sampled value of a random 
variable with the distribution f{d\ \xi — Xj |), we would like to choose an embedding of 
Q (or more precisely, an embedding of the vertices of Q) into the real line: 



{xi,X 2 ,...,Xp} C [0,G], 
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that maximizes a likelihood function F{x\,X 2 , ■ ■ ■ ,xp\dij : i,j G [1 , P] ) . By ignoring 
the week dependencies, we approximate F as: 

f{\^i ~ Xj\\dij). 

Hence, we can minimize a related cost function 

-lnf{\£,~ £j\\dij). 



Lemma 35. The Optimization problem of finding Xj to minimize f{xj\{xi : i < j}, 
{di^j : i < j}) is approximated by solving the following optimization problem: 

minimize Wij{\xi — xj\ — dij)"^ , 



where Wij ’s are positive real valued weight functions: 






1 

2a‘^dij 

e 



if 

otherwise^ 



ande = 0 
Proof 



-ln/(a;|d) 



(a; — dY 
2a‘^d 



+ In (V2TTda) if d < L; 



ln(G — L) — otherwise. 



Hence 



^ ' In/daij £j\\dij) — ^ ^ Wij{\xi xf dtj) . 

1<Z,J<P 



Note that e = „ / , < „ P < ^Tvir as ctm being the maximum variance is 

bounded by (G — L) □. 



3.2 Simple Algorithm 

The simplest algorithm to place probes proceeds as follows: Initially, every probe occurs 
in just one singleton contig, and the relative position of a probe x i in contig Ci is 
at the position 0. At any moment, two contigs Gp = [ip^, p, . . ., ip,] and Cq = 
[igj, igj, . . ., Xq^] may be considered for a “join” operation: the result is either a 
failure to join the contigs Gp and Cq or a new contig Cr containing the probes from 
the constituent contigs. Without loss of generality, assume that |Gp| > |Gg|, and that 
the probe corresponding to the right end of the first contig {x p, ) is closet to the left end 
of the other contig {Xqfi. That is the estimated distance dp^^q^ is smaller than all other 
estimated distances: dpi , 91 , dp^^q^ and dp^^q^. 
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Xpi 



^pi,qi 

^ 1 

nr nr nr • • • rp 

d.pi Xq-3 Xq^ 



Cp 



Cq 



Let 0 < 0 < 1 be a parameter to be explored further later, and L' = L6 < L. If 
dpi,qi > then the join operation fails. Otherwise, the join operation succeeds with 
the probes of Cp placed to the left of the probes of Cq, with all the relative positions of 
the probes of each contig left undisturbed. We will estimate the distance between the 
probes in Cp and the probe Xq-^ by minimizing the function: 



minimize 



E 



i^qi 







where i^’s (i G {pi, . . . ,pi}) are fixed by the locations assigned in the contig Cp. Thus 
taking a derivative of the expression above with respect to x and equating it to zero, 
we see that the optimal location for Xq^ in Cr is 



d* = max 



X)i6{pi,...,p,}:(ii,,^<L' + di.qi) ja^di^q^ 



E 



ietpi <L 



, 1/a'^di 



Once the location of Xq^ is determined in Cr at d* , the locations of all other probes of 
Cq in the new contig Cr are computed by shifting them by the value d* . Thus 



Cr — [in ) ■ ■ ■ ) j iri+i > ■ ■ ■ > ii’i+m] > 

where = pi and Xn = ipn for 1 < i < I; ri+i = qi and in+i = d* + Xq^, 

for 1 < i < m. Note that when the join succeeds, the distance between the pair of 
consecutive probes and in+i 

0 < iri + l ~ ^ri < L' , 



and the distances between all other consecutive pairs are exactly the same as what they 
were in the original constituent contigs. Thus, in any contig, the distance between every 
pair of consecutive probes takes a value between 0 and L'. Note that one may further 
simplify the distance computation by simply considering the k nearest neighbors of x 
from the contig Cp'. namely, ip, , . . ., ip, . 



d*). = max 






91 






In the greediest version of the algorithm k = 1 and 



d\ ipi + dpi , 

as one ignores all other distance measurements. 
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At any point we can also improve the distances in a contig, by running an “adjust” 
operation on a contig Cp with respect to a probe Xp ^ , where 

Cp = [xp^ ) ■ ■ ■ ) Xp^_-^ , Xpj , ) • ■ ■ ) ipi ■] 



Xpj adjust 



Xpi 






Xpi 



We achieve this by minimizing the following cost function: 



minimize 



E 

ie{pi,...,Pi}\{pj}:di,p^. <L' 



(|xpj - Xj\ - d^^p^Y 
‘2a'^di^p^ 



where Xi’s (i G {pi, . . . ,Pi}\{pj}) are fixed by the locations assigned in the contig Cp. 
Let: 



Ii — {zi G {pi, ■ ■ ■ ,Pj—i} : di^^pj < L } 

h = {i 2 G {pj+i, ■■■,Pi} ■■ di^^p^ < L'} 

♦ _ (^^2 ~ di 2 ,pj) 

Eiie/i 

At this point, if x* ^ Xp^ , then the new position of the probe Xp^ in the contig Cp is 
X* . As before, one can use various approximate version of the update rule, where only 
k probes from the left and k probes from the right are considered and in the greediest 
version only the two nearest neighbors are considered. Note that the “adjust” operation 
always improves the quadratic cost function of the contig locally and since it is positive 
valued and bounded away from zero, the iterative improvement operations terminate. 



4 Implementation of the fc-Neighbor Algorithm 

INPUT 

The input domain is a probe set V, and a symmetric positive real-valued distance weight 
matrix D G where P = \V\. 

PRE-PROCESS 

Construct a graph Q' = {V,E'), where E' = {ck = {xi,Xy)\dij < L'}. The edge 
set of the graph Q' is sorted into an increasing order as follows: Ci, 62, . . ., eg, with 
Q = \E'\ such that for any two edges = [2;,^ , Xj^] and Cfca = [xi^ , Xj^], if ki < ^2 
then dii < di2,i2- S' can be constructed in 0(|Up) time, and its edges can be sorted 
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in 0{\E'\ log(|y D) time. In a simpler version of the algorithm it will suffice to sort the 
edges into an “approximate” increasing order by a parameter H ij (related to dij) that 
takes values between 0 and M. Such a simplification would result in an algorithm with 
0{\E'\ logM) runtime. 

MAIN ALGORITHM 

Data-structure: Contigs are maintained in a modified union-find structure designed to 
encode a collection of disjoint unordered sets of probes which may be merged at any 
time. Union-find supports two operations, union and find [rm . union merges two sets 
into one larger set, find identifies the set an element is in. At any instant, a is represented 
by the following: 

- Doubly linked list of probes giving left and right neighbor with estimated consecu- 
tive neighbor distances. 

- Boundary probes: each contig has a reference to left and right most probes. 

In the fcth step of the algorithm consider edge = [xi, Xj]: if find(a;i) and find(a;j) 
are in distinct contigs Cp and Cq, then join Cp and Cq, and update a single distance to 
neighbor entry in one of the contigs. 

At the termination of this phase of the algorithm, one may repeatedly choose a 
random probe in a randomly chosen contig and apply an “adjust” operation. 

OUTPUT 

A collection of probe contigs with probe positions relative to the anchoring probe for 
that contig. 

4.1 Time Complexity 

First we estimate the time complexity of the main algorithm implementing the A:— neigh- 
bor version: For each e G E' there are two find operations. The number of union oper- 
ations cannot exceed the number of probes P = |U|, as every successful join operation 
leading to a union operation involves a boundary vertex of a contig. Any vertex during 
its life time can appear at most twice as a boundary vertex of a contig, taking part in 
a successful join operation. The time cost of a single find operation is at most 7 (P), 
where 7 is the inverse of Ackermann’s function. Hence the time cost of all union-find 
operations is at most 0{\E'\'y{P)). The join operation on the other hand requires run- 
ning the fc— neighbor optimization routine which is done at a cost 0{k). Thus the main 
algorithm has a worst case time complexity of: 

o(\E'MV\) + k\V\) 

The Full Algorithm including preprocessing is: 

o(|P'|log(|U|) + |U| 2 ) 

In a slightly more robust version the contigs may be represented by a dynamic balanced 
binary search tree which admit find and implant operations. Each operation has worst 
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case time complexity of 0(log(|U|)). Thus after summing over all \E'\ operations the 
worst case runtime for the main algorithm is: 

o(\E'\\og{\V\) + k\V\) 

and for the full algorithm is: 

o(\E'\\og{\V\) + \V?) 



mean d(x) of 1 00 samples when l(BAC)=1 60 var d(x) of 1 00 samples when l(BAC)=1 60 





5 What Do the Simulations Tell? 

5.1 Simulation: Observed Distance 

The sample mean and variation of the distance function are computed with a simple 
simulation done in-silico. BACs are 160Kb in length, we generate 1,200 BACs and 
place them randomly on a genome of size G = 32, 000Kb, This gives a 6x BAC set. 
In this experiment 100 random points are chosen on the genome and for each point we 
compute the Hamming distance compared to points 10, 20, 30, . . . 300 Kb to the right 
on the Genome. Color sequences are computed by using 20 samples of 130 randomly 
chosen BACs of which half are likely to be red and the other half green. 

5.2 Simulation: Full Experiment 

Below we describe an in-silico experiment for a problem with 150 probes. On a Genome 
of size 5,000 Kb we randomly place 150 probes, there positions are graphed as a mono- 
tone function in the probe index. Next we construct a population of 500 randomly placed 
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BACs. From the population we repeat a sampling experiment using a sample size of 32 
BAGS 16 are colored red, and 16 are colored green. Each sample is hybridized in-silico 
to the probe set. Here we assume a perfect hybridization so there are no cross hybridiza- 
tions or failures in hybridizations associated with the experiment. We repeat the sample 
experiment 130 times. This produces the observed distance matrix, whose distribution 
we modeled earlier. This is the input for the algorithm presented in this paper. In the 
distance vs. observed data plot we see that using a large M = 130 (suggested by the 
Chernoff bounds) has its benehts in cutting down the rate of the false positives. The 
observed distance matrix is input into the (10— neighbor, 6 = algorithm without 
the use of the adjust operation, the result is 7 contigs. The order within contigs had five 
mistakes. We look at the the 4th contig and plot the relative error in probe placement. 



probe pair-wise observed distance vs 

positions probe distance distance observed 




inferred inferred inferred order difference in relative 

probe positions contig structure given contig order positions for largest contig 





6 Future Work 

The more robust variation of the algorithm based on a dynamically balanced binary 
search tree will be studied in more detail. A comparison with Traveling Salesman 
heuristics and an investigation of an underlying relation to the heat equation will show 
why this algorithm works well. We will work on a probabilistic analysis for the statis- 
tics of contigs. A model incorporating failure in hybridization and cross hybridization 
will be developed. We are able to prove that, if errors are not systematic, then a slight 
modification of the Chernoff bounds presented here can be applied to ensure the same 
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results. We shall also consider the choice of prohes to limit the cross-hybridization er- 
ror and a choice of melting points to further add to the goal of decreasing experimental 
noise. A set of experimental designs will be presented for the working biologists. More 
extensive simulations, and results on real experiments shall report the progress of what 
appears to be a promising algorithm. 
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Appendix 

A Proof of Lemma 2.1 

Lemma Al. ~ Bin ^ Bin (M, f (e-“- 

2e~^)x + 0{x^)),Tij ~ Bin with parameters {P, L, K, M) as 

above and c = i, j are arbitrary indices from the Clone Set and x is the actual 
distance as number of bases separating the probe positions on the Genome. 

Proof Since the M samples are done independently the proof reduces to showing that 
when M — 1 the probabilities are Bernoulli with respective parameters. Let us define 
events T = (iB^js) , C = ((z/j A jk) V (ig A jc) V (zy A jr)), andi? = (^TA^G). 

Given a set of K BACs on a genome [0, G] the probability that none start in an 
interval of length Hs (1 — a)^ « where a = ^. 

Shown below is a diagram that is helpful in computing the probabilities for events 
G, H, T when x < L. The heavy dark bar labeled a represents a set of BACs which 
covers probe pi but not pj ; the bar labeled b represents a set of BACs that covers probe 
Pi and Pj ; finally, the bar labeled c represents a set of BACs that covers pj but not pi. 





c 


h 






a 




i 






® ^ L — X pi 


p. 



Hence we derive: 

P{T\x < L) = exp(-(afi -f aa){L + x)) 
P{iR A 3r\x <L) = _ g-anU-U) 
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+ (1 - - e-“«"^)} 

_ g-“G(i+a;)|]^ _ 2e~“^^ + 

P{iG A jc\x <L) = e-“«(^+^){l - 2e~°‘^^ + e-“G(i+^)} 

P{iY A jy\x < L) = (1 - + e-“«(^+"=)) (l - 

P{C\x < L) = P{iR^jR\x < L) + P{ia ^jG\x < L) + P{iy Ajy\x < L) 
P{H\x <L) = l- [P{T\x <L) + P{C\x < L)] 



When X > L the probabilities are: 

P{T\x > L) = exp(-(afl + aG){2L)) 

P{iR A Jr\x >L) = - e-“«^)2} 

P{iG A Jg\x >L) = - e-“«^)2} 

P{iY A jy\x > L) = (1 - 

P{C\x > L) = P{iRAjR\x > L) + P{iG A Jg\x > L) + P{iy A jy\x > L) 

P{H\x >L) = l- [P{T\x >L) + P{C\x > L)] 

Because ur = ac, grL = ggL = | = Let q = q{x) = P{H) and 
p = p{x) = P{C). In general q{x) andp(a;) are complicated function of x, below we 
derive a first order approximation of x{q) to be used as a biased estimator. 

P{H) = (1 - (1 - 2e^ + + 0{x^) 



P{T) = 



P(C) = 1 - e-" + ^(e-" - 2e~i)x + 0(x") 



With independent sampling: 

P(H,j) ~ Bin (Af, 






P{Cij) ^ Bin (M, 1 — e *^ + -(e ^ — 2e ^)x 0{x‘^)) 
P{T,^,) ~ Bin (M, (e-"(i+^))) □ 



B Proof of Lemma 2.3 Using Bayes’ Formula 



Lemma Bl. 7//(d|x) = I 



— (d — 



0<ai<L 



}{x\d)^ld<L- 



\/2‘KX<7 
-{x-df j2Pd 



+ ^L<x<G 



\/2TTdl 



+ Id>L ^L<x<G 



a 



^-{d-Ly /2(7^L 

\Z27 tLo' 

1 

G-L 



. Then 
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Proof. 



f{d\x)f{x) dx 



i( 



-id-xY/la^x -(d-LY /2 (t‘^ L' 



/2'kLo' 



h lo (^o< 



xKL ' 



^-{d-x)^/2a 



\/2tFxo 



+ Il<x<G 



^_(d_I,)2 /2 o-2l \ 

/2-KLa ) 



dx 



For small values of Y the denominator in the above expression can be approximated 
as followj^: 



fid) = ( ^ 



rL ^ — {d—x)^f2a^x 



y/2nxo 

L\ 



dx 



G-L\ e-^d-Lfjio-L 



G 



'J 2-KLa 



— ld<L 1 ^ ~ g j dd=L- 



Thus, we make further simplifying assumptions and choose the following likelihood 
function: 



^-{x-df /2(7^d / 1 \ 

f{x\d)Kild<L ^2Kda ^^d>L^L<x<G ( ^ , □ 

C How Good Are the Results? 

C.l False Positives, False Negatives 

We treat the problem of false positives, and false negatives with Chernoff ’s tail bounds. 
We hnd upper bounds on the probability of getting a false positive or false negative in 
terms of the parameters 0, M, c = Q < 0 <1, L' = L9 < L. 

A false positive is a pair of probes that appear to be close by the Hamming distance 
but are actually far apart on the genome. We denote the event as: 

F.P. ={d< L') A{x> L) 

A false negative is a pair of probes that appear to be far by the Hamming distance but 
are actually close on the genome. We denote the event as: 

F.N. = (x < L') A{d> L) 

In the following picture the volume of data which are false positives and false neg- 
atives are indicated by the squares noted F.P. and F.N. respectively. 



2 



The Dirac Delta Function is distribution defined by the equations 

Sx=o = 0 if a: 7 ^ 0 1 
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We develop a Chernoff bound to bound the probability that the volume of false 
positive data is greater than a specihed size. 

The Chernoff bounds for a binomial distribution with parameters (M, q) are given 
by: 



P{H > {l + v)Mq) < 



\ Mq 



(1 



P{H < 6Mq) < e 



w)(i+«) J 

Mg(l-6>)^ 



with V > 0 
with 0 < 0 < 1 



Let iJ (M) be the Hamming distance when M phases are complete. Let q{L) = P{H\x 
> L) « 2^ = ^ We start by noting equivalent events: 



{d < 9L\x > L) = (a‘^H{M) < 0L\x > L) 



= (H{M) < 0—\x > L) 

(7^ 

C {H{M) < 0^) 

62 

= {H{M) < 0Mq{L)) 



Lfsing the Chernoff hound we have: 



-Mc(l-e)^ 

P{d < 0L\x > L)< P{H{M) < 0MqL) < e 
For the False Negatives we begin by noting that: 

{d > L\x < 0L) = (a^H{x) > L\x < L') 

= {a^H{x) > (1 + v)L'\x < L') where t; = (^ — 1) 

U 

= {H{x)>^^±^\x<L') 

C (i^(x) > (1 + v)Mq{x)) 
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The last event inclusion is because: 



, /2cMx 2cML\ / X 1 r/\ 

{x< L) ^ j- ) => (Mq{x) < —L ) 



Applying the Chernoff bound we get: 



62 L 



Mq{x) 



P{ F.N. )<P{H>{1 + v)Mq{x)) < 



= (e^» 



1 



Chernoff bounds are: 



-Mc(l-e)^ 

P{ F.P. ) < 6 

1 1 M 

P(F.N. ) < ^ 



The Chernoff bounds for typical parameters are shown below. 



Chernoff F.P. Chernoff F.N. F.P. Chernoff F.N. Chernoff 

Bound contour Bound, contour upperbound 0=.7 upperbound 6=.7 
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Abstract. Stochastic models are commonly used in bioinformatics, e.g., hid- 
den Markov models for modeling sequence families or stochastic context-free 
grammars for modeling RNA secondary structure formation. Comparing data is 
a common task in bioinformatics, and it is thus natural to consider how to com- 
pare stochastic models. In this paper we present the first study of the problem of 
comparing a hidden Markov model and a stochastic context-free grammar. We 
describe how to compute their co-emission — or collision — probability, i.e., the 
probability that they independently generate the same sequence. We also con- 
sider the related problem of finding a run through a hidden Markov model and 
derivation in a grammar that generate the same sequence and have maximal joint 
probability by a generalization of the CYK algorithm for parsing a sequence by a 
stochastic context-free grammar. We illustrate the methods by an experiment on 
RNA secondary structures. 



1 Introduction 

The basic chain-like structure of the key biomolecules, DNA, RNA, and proteins, al- 
lows an abstract view of these as strings, or sequences, over finite alphabets, obviously 
of finite length. Furthermore, these sequences are not completely random, but exhibit 
various kinds of structures in different contexts. E.g. a family of homologous proteins is 
likely to have similar amino acid residues in “equivalent” positions; an RNA sequence 
will have pairs of complementary subsequences to form base pairing helices. Hence, it 
is natural to consider applying models from formal language theory to model different 
classes of biological sequences. 

Though not completely random, biological sequences can still possess inherent 
stochastic traits, e.g., due to mutations in a family of homologous sequences or a lack 
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of knowledge (and computing power) to correctly model all aspects of RNA secondary 
structure formation. Thus, it is often better to use stochastic models giving a probability 
distribution over all sequences, where a high probability reflects a sequence likely to be- 
long to the class of sequences being modeled, instead of formalisms only distinguishing 
sequences as either belonging to the class being modeled or not. The two most widely 
used grammatical models in bioinformatics are hidden Markov models [□□inan 
and stochastic context-free grammars iTBirra rTi. though other models have also been 
proposed HTHIini. These two types of stochastic models were originally developed as 
tools for speech recognition (see [G21E1)- One can identify hidden Markov models as 
a stochastic version of regular languages and stochastic context-free grammars as a 
stochastic version of context-free languages (see IfTTII for an introduction to formal lan- 
guages). A more in-depth treatment of biological uses of hidden Markov models and 
stochastic context-free grammars can be found in [El Chap. 3-6 and 9-10]. 

As stochastic models are commonly used to model families of biological sequences, 
and as a common task in bioinformatics is that of comparing data, it is natural to ask 
how to compare two stochastic models. In m we described how to compare two hid- 
den Markov models, by computing the co-emission — or collision — probability of the 
probability distributions of the two models, i.e., the probability that the two models 
independently generate the same sequence. Having the co-emission probability for a 
pair of probability distributions as well as for each of the distributions with itself, it is 
easy to compute the L 2 - and the Hellinger-distance between the two distributions. In 
this paper we study the problem of comparing a hidden Markov model and a stochastic 
context-free grammar. We develop recursions for the co-emission probability of the dis- 
tributions of the model and the grammar, recursions that lead to a set of quadratic equa- 
tions. Though quadratic equations are generally hard to solve, we show how to find an 
approximate solution by a simple iteration scheme. Furthermore, we show how to solve 
the equivalent maximization problem, the problem of finding a run through the hidden 
Markov model and derivation in the grammar that generate the same sequence and have 
maximal joint probability. This is in essence parsing the hidden Markov model by the 
grammar, so the algorithm can be viewed as a generalization of the CYK algorithm 
for parsing a sequence by a stochastic context-free grammar. Indeed, in most cases the 
complexity of our algorithm will be identical to the complexity of the CYK algorithm. 
Finally we discuss the undecidability of some natural extensions of our results. 

The structure of this paper is as follows. In Sect. QandElwe briefly introduce hid- 
den Markov models and stochastic context-free grammars and the terminology we use. 
In Sect. El we consider the problem of computing the co-emission probability of the 
probability distributions of a hidden Markov model and a stochastic context-free gram- 
mar, and in Sect. El we develop the algorithm for parsing a hidden Markov model by a 
stochastic context-free grammar. In Sect. Elwe present an illustrative experiment and in 
Sect. 0we discuss the problems occurring when trying to extend the methods presented 
in Sect. 0and0 
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2 Hidden Markov Models 

A hidden Markov model M is a generative model consisting of n states and m transi- 
tions between states, where each state is either silent or non-silent. The model generates 
a string s over a finite alphabet E with probability Pm{s) such that Pm{s) = 1, 
i.e., M describes a probability distribution over the set of finite strings over E. A hidden 
Markov model is a left- right model if the states can be numbered 1,2, ... ,n such that 
all transitions i ^ j from state i to state j satisfy i < j. 

A run in M begins in a special start-state and continues from state to state according 
to the state transition probabilities, where is the probability of a transition from 
state q to q' , until a special end-state is reached. A state is either a silent or a non- 
silent state. Each time a non-silent state is entered, a symbol is emitted according to 
the symbol emission probabilities, where is the probability of emitting symbol a G 
E in state q. When entering a silent state, nothing is emitted. A run thus follows a 
Markovian path tt = (tto, tti, . . . ,'Kk) of states and generates a string s G E* which 
is the concatenation of the emitted symbols. The probability Pm{s) of M generating s 
is the probability of following any path and generating s. The probability PM{tt,s) 
of following path tt = (tto, tti, . . . , TTfc) and generating s depends on the subsequence 
(TTij , TTij , . . . , Ttii ) of non-silent states on the path tt. If the length of s is different from 
the number of non-silent states along tt, the probability Pm(7t, s) is zero. Otherwise: 

k I 

PM{n,s) = PM{Tt) ■ Pm{s I tt) = 

i—1 j—^ 

A partial run in M is a run that starts in a state p and ends in a state q, which are not 
necessarily the special start- and end-states. To ease the presentation of the methods in 
Sect. 0 we introduce the concept of a partial run from pto q semi-including q meaning 
that if g is a non-silent state then no symbol has yet been emitted from q. The probability 
of a partial run from p to q semi-including q and generating s is thus the probability 
of generating s on the path from p to the immediate predecessor of q and taking the 
transition to q. Given a string s and a model M, efficient algorithms for determining 
Pm{s) and the run in M of maximal probability generating s are known, see [\im- 

3 Stochastic Context Free Grammars 

A context-free grammar G describes a set of finite strings over a finite alphabet E, also 
called a language over E. It consists of a set V of non-terminals, a set T = U {e} of 
terminals, where e is the empty string, and a set P of production rules a ^ P, where 
a G V and P G {V U T)^ . A production rule a ^ P means that a can be rewritten 
to P by applying the rule. A string s G E* can be derived from a non-terminal U, 
U ^ s,\lU can by rewritten to s by a sequence of production rules; s is in the language 
described by G if it can be derived from a special start non-terminal S. A derivation D 
of s in G is a sequence of production rules which rewrites S' to s. A derivation of s in G 
is also called a parsing of s in G. 
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A stochastic context-free grammar G is a context-free grammar in which all pro- 
duction rules a ^ (3 are assigned probabilities Pq {a (3) such that the sum of the 
probabilities of all possible production rules from any given non-terminal is one. The 
resulting stochastic grammar describes a probability distribution over the set of hnite 
string over the finite alphabet S, where the probability Pg{s) of deriving s in G (some- 
times also referred to as the probability of G generating s) is the sum of the probabilities 
of all possible derivations of s from S. The probability of a derivation of s in G, i.e., 
Pg{D) = Pg{S => s), is the product of the probabilities of the production rules in the 
sequence D applied to derive s from S. 

Any (stochastic) context-free grammar can be transformed to an equivalent (stochas- 
tic) context-free grammar in Chomsky normal form which describes the same language 
but where the production rules can be split into two sets; a set of non-terminal pro- 
duction rules on the form U — > XY, and a set Pt of terminal production rules on the 
form U ^ a and (7 — *■ e, where U, X and Y are non-terminals, cr is a symbol in S, 
and e is the empty string. For a stochastic context-free grammar G in Chomsky normal 
form, we can compute Pg{s) by the inside algorithm, and the most likely parse of s 
in G by the CYK algorithm, in time 0(|P| • |s|^) where |P| is the number of production 
rules in G, i.e., |P„| + \Pt\, and |s| is the length of s, see 0 . 

4 Comparing a SCFG and a HMM 

In this section we will consider the problem of comparing a stochastic context-free 
grammar G (in Chomsky normal form) and a hidden Markov model M which generate 
strings over the same alphabet X. More precisely, we will consider the problem of 
computing the co-emission probability, G(G, M), of the probability distributions of G 
and M over all strings, i.e., the quantity 

C{G,M)= Y, Pg{s)-Pm{s), 

which is similar to the definition of the co-emission probability of two hidden Markov 
models in ina. This quantity is also often referred to as the collision probability of 
the probability distributions of G and M, as it is the probability that strings picked 
at random according to the two probability distributions collide, i.e., are identical. We 
initially assume that M is an acyclic hidden Markov model, i.e., a left-right hidden 
Markov model where no states have self-loop transitions. By itself, this is not a very 
interesting class of hidden Markov models, e.g., a model of this type cannot generate 
strings of arbitrary length, but the ideas of our approach for computing the co-emission 
probability for this class of models are also applicable to left-right and general hidden 
Markov models. 

For acyclic hidden Markov models we can use an approach closely mimicking the 
inside algorithm for computing the probability that a stochastic context-free grammar 
generates a given string. In the inside algorithm, when computing the probability that 
a string s is derived in a stochastic context-free grammar, track is being kept of the 
probability that a substring of s is derived from a non-terminal of the grammar. In our 
algorithm for computing the co-emission probability of G and M we keep track of the 
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(a) The A array holds the probahilities of getting from one state p to another state g in M 
without emitting any symbols. 
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(b) The B array holds the probabilities of getting from one state p to another state q in 
M while emitting only one symbol and at the same time generating the same symbol by a 
terminal production rule for the non-terminal U in G. 
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(c) The C array holds the probabilities of getting from one state p to another state q in 
M while emitting any string and at the same time generating the same string from a non- 
terminal U in G. 



Fig. 1. Illustration of the individual purposes and recursions of the three arrays used. 
Hollow circles denote silent states, solid circles denote non-silent states, and hatched 
circles denote states of any type. Squiggle arrows indicate partial runs of arbitrary length 
and straight arrows indicate single transitions between states. 



probability of deriving the same string from a non-terminal that is generated on a partial 
run from a state p to a state q semi-including g in M. In our dynamic programming 
based algorithm we maintain three arrays. A, B, and C, for the following purposes. 
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- A{p, q) will be the sum of the probahilities of all partial runs from p to q semi- 
including q that does not emit any symbols, illustrated in Fig. |l(a)| l.e. all states on 
the partial runs, except possibly for q, are silent states. 

- B{U,p, q) is the probability of independently deriving a single symbol string from 
the non-terminal U and generating the same single symbol string on a partial run 
from pto q semi-including q, illustrated in Fig. ESI 

- C{U,p,q) is the probability of independently deriving a string from the non-ter- 
minal U and generating the same string on a partial run from pto q semi-including 
q, illustrated in Fig. E3 

The purpose of the A and B arrays is to deal efficiently with partial runs consisting 
of silent states. As all symbols of a string are in a sense “non-silent”, this is the main new 
problem encountered when modifying the inside algorithm to “parse” acyclic hidden 
Markov models. 

The C array is similar to the array maintained in the inside algorithm. The array 
of the inside algorithm tells us the probability of deriving any substring of the string 
being parsed from any of the non-terminals of G. Similarly, the C array will tell us 
the probability of independently generating a sequence on a partial run between any 
pairs of states and at the same time deriving it from any of the non-terminals of G. It is 
evident that G{G, M) = G{S, start, end), where S is the start symbol of G and start 
and end are the start- and end-states of M, as the co-emission probability of G and M 
is the probability of deriving the same string from S that is generated on a (genuine) 
run from start to end. This is assuming that the end-state of M is silent, so that any 
partial run from start to end semi-including end in M is also a genuine run in M. 

Having described the required arrays, we must specify recursions for computing 
them — and argue that these recursions split the computations into ever smaller parts, 
i.e., that the dependencies of the recursions are acyclic — to obtain a dynamic program- 
ming algorithm computing G{G, M). The A array are the probabilities for getting from 
the state p to the state q in M along a path that only consists of silent states. In general 
such a path can be broken down into the last transition of the path and a preceding path 
only going through silent states. Hence, we obtain the following recursion where the 
hrst case takes care of the initialization. 



I 1 if p = q 

A{p,q)=l ^(PA)-a^q otherwise (1) 

I p<r<q, 

A r silent 

The ordering of states referred to in the summation is any ordering consistent with the 
(partial) ordering of states by the acyclic transition structure. One immediately observes 
that each entry of the A array requires time at most time 0(n) to compute. Thus the 
entire A array can be computed in time O(n^). However, one can observe that we 
actually only need to sum over states r with a transition to q in the summation of (0. 
This observation reduce the time requirements to 0(nm) as each transition is part of 
0(n) of the above recursions. 

The B array holds the probabilities for getting from the state p to the state q in M by 
a partial run generating exactly one symbol, and at the same time generating the same 
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symbol from C/ by a terminal production rule. In general we can partition the partial run 
into an initial path consisting only of silent states and the remaining partial run starting 
with the non-silent state emitting the symbol. Hence, we obtain the following recursion 
where “initialization” is handled by paying special attention to the special case where 
the initial path of silent states is empty. 

B(U,p,q)= Y. PciU^a)- 

(U^cr)eG 

^ 9) + 9)) ( 2 ) 

p<r<q p<r<q, 

r non-silent 

To find the time requirements for computing the B array one observes that each non- 
terminal of G occurs in O(n^) entries. Hence, each terminal production rule of G is 
part of O(n^) of the above recursions. Each of these recursions requires time 0(n) to 
compute. Thus, computing the B array requires time 0(|Pt| • n^). 

The C array should hold the probabilities for getting from the state p to the state q 
in M by a partial run generating any string, and at the same time generating the same 
string from a non-terminal U in G. The purpose of the A and B arrays is to handle the 
special cases where at most a single symbol is emitted on a partial run from p to q. In 
these cases we do not really recurse on a non-terminal U from G but either ignore G 
completely or only consider terminal production rules. In the general case where a string 
with more than one symbol is generated, we need to apply a non-terminal production 
rule U XY from G. Part of the string is then derived from X and part of the string 
is derived from Y. This leads to the following recursion. 

C{U,p,q) = Pg{U e) ■ A{p,q) +B{U,p,q)+ 

Y PciU^XY)- Y C{X,p,r)-G{Y,r,q) (3) 

{U^XY)£G P<r<q 

r non-silent 

The reason for requiring r to be a non-silent state in the last sum is to ensure a unique 
decomposition of the partial runs from p to q. If we allowed r to be silent we would 
erroneously include some partial runs from p to g several times in the summation. As 
we did with the computation of the B array, we can observe that each non-terminal pro- 
duction rule is part of O(n^) of the above recursions, and that each recursion requires 
time 0(n) to compute. Hence, we obtain total time requirements of 0(|P„| • n^) for 
computing the G array. Adding the time requirements for computing the three arrays 
leads to overall time requirements of 0(|P| • n^) for computing the co-emission proba- 
bility of G and M . This is in correspondence with the time requirements ofO(|P|-|s|^) 
for parsing a string s by the grammar G cf. Sect. 0 

As previously stated, in order for these recursions to be used for a dynamic pro- 
gramming algorithm, we need to argue that the recursions only refer to elements that in 
some sense are smaller, l.e. that no recursion for any of the entries of the three arrays 
depends cyclicly on itself. But this is an easy observation as all pairs of states indexing 
an array on the right-hand side of the recursions are closer in the ordering of states than 
the pair of states indexing the array on the left-hand side of the recursions. 
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The point of always recursing by smaller elements is exactly where we run into 
problems when trying to extend the above method to allow cycles in M. Even when 
only allowing the self-loops of left-right models this problem crops up. We can easily 
modify (D, (□, and m to remain valid as equations for the entries of the A, B and C 
arrays. All that is needed is to change some of the strict inequalities in the summation 
boundaries to include equality. More specifically, if we assume that only non-silent 
states can have self-loop transitions (self-loop transitions for silent states are rather 
meaningless and can easily be eliminated), we need to change (0 and © to 

B{U,p,q)= Y. PciU^a)- 

(U^cr)eG 

• Y 9) + X! 9)) : ( 4 ) 

p^r<q 

r non-silent 

C{U,p,q) = Pg{U e) ■ A{p,q) +B{U,p,q)+ 

Y PciU^XY)- Y C{X,p,r)-C{Y,r,q). ( 5 ) 

(U^XY)GG P<r<q 

For the B array, (El still refers only to smaller elements. Either the A array or an entry 
of the B array where the two states are closer in the ordering as r has to be strictly larger 
than p in the ordering in the last sum. But for the C array we might choose r equal to 
eitherp or q in the last sum. Hence, to compute C{U,p, q) we need to know C{X,p, q) 
and C{Y,p, q) which might in turn depend on (or even be) C{U,p,q). 

But, as stated above, ( 0 ) still holds as equations for the entries of the C array. For 
each pair of states, p and q, in M we thus have a system of equations with one vari- 
able and one equation (and the restriction that all variables have to attain non-negative 
values) for each non-terminal of G. Assume that we solve these systems in an order 
where the distance between the pair of states in the ordering is increasing, i.e., we first 
consider systems with p = q, then systems with q being the successor to p, etc. Then 
most of the systems will be systems of linear equations. In the equation for a given 
entry C{U,p,q) the only unknown quantities are the occurrences of C{X,p,q) and 
C{Y,p, q) corresponding to production rules U XY of G. These occurrences have 
coefficients Pg(U XY) ■ C(Y, q, q) and Pg{U XY) ■ G{X,p,p), respectively, 
coefficients that have known values if p < q. But if p = q the last sum of (0 will lead 
to a number of terms of the form G{X,p,p) ■ C{Y,p,p). I.e. the system of equations 
is quadratic. Hence, for each state with a self-loop transition we need to solve a system 
of quadratic equations with one variable and one equation for each non-terminal of G. 
General systems of quadratic equations are hard to solve, see but the construction 
proving this requires equations with all terms having coefficients with the same sign. 
One can immediately observe that in a system of equations based on (H the left-hand 
side terms will have coefficients that have opposite sign of the right-hand side terms. 
Hence, the hardness proof does not relate to the system of quadratic equations we obtain 
from ( 0 . We have not been able to find any literature on algorithms solving systems of 
the type derivable from 0. But as all the dependencies are positive we can approximate 



Comparing a Hidden Markov Model and a Stochastic Context-Free Grammar 



77 



a solution simply by initializing all entries to the terms depending only on the A and 
B arrays and then iteratively update each entry in turn. This process will converge to 
the true solution. We conjecture that this convergence will be very rapid for all realistic 
grammars and models. 

For general hidden Markov models we do not have an ordering of the states of 
M. Hence, to modify and 0 to hold for general hidden Markov models, the 

ordering constraints on states should be removed from all the summations. When we 
do no longer have any ordering of the states, all entries might depend on each other. 
Thus, we cannot separate the systems of equations for the entries of the C array into 
independent blocks based on the pair of states indexing the entry of the array. This 
means that we obtain just one aggregate system of quadratic equations with \V\ ■ 
variables and equations for the entries of the C array, where |y| is the number of non- 
terminals in G and n is the number of states in M. However, for the entries of the A and 
B arrays we still only get systems of linear equations. Actually, the B array can even 
be computed by simple dynamic programming. 

5 Parsing an HMM by a SCFG 

Given a stochastic context-free grammar G modeling RNA secondary structures and 
a hidden Markov model M representing a family of RNA sequences, we can use the 
method of the previous section to determine the likelihood that the family has a common 
secondary structure. This being established, one might want to find the most likely 
structure. In other words, an equivalent to the CYK algorithm for finding the most 
likely parse of a string, but finding the most likely “parse” of a hidden Markov model, 
is desirable. Hence, we want to compute 

max{PG(-D) • Bmir) \ D a derivation in G, r a run in M, and S ^ s^} , (6) 

where Sr is the string generated by the run r, preferably in a way that allows us to easily 
extract a derivation D and a run r witnessing the maximum. 

The basic principles we will use to compute the value of (0 are similar to those 
used for computing the co-emission probability in the previous section. Thus, the gen- 
eral technique is to find the probabilities of optimal combinations of a derivation from a 
non-terminal U in G and a partial run between a pair of states p and qinM yielding the 
same string. Furthermore, this optimal combination is split into a pair of optimal combi- 
nations of a derivation from a non-terminal and a partial run between states as illustrated 
in Fig. |f(c)l However, computing the maxima is easier than computing the sums. The 
main reason for this is that we do not have to consider combinations of derivations and 
partial runs of arbitrary lengths. The probability of an optimal combination for a par- 
ticular choice of a non-terminal and pair of states cannot depend cyclicly on itself — the 
product of two or more probabilities can never be larger than any of these probabilities. 
A similar principle is used in algorithms for finding shortest paths in graphs with only 
positive edge weights, cf. 10 chapters 25-26]. Not surprisingly, the methods described 
in this section will bear strong resemblances to such algorithms, though the particular 
nature of the problem does prevent formulating it simply as a shortest path problem in 
a well chosen graph. 
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The description of our method will employ three arrays A max, ^max, and Cmax, 
similar to the A, B, and C arrays used in previous section. The Amax(p, q) entries hold 
the maximum probability of any partial run from state p to state q semi-including q not 
emitting any symbols. But this is just the path of maximum weight in the graph defined 
by the transitions not leaving a non-silent state in M. We can thus compute the A max 
array by standard all-pairs shortest paths graph algorithms in time 0(n^ log n + nm), 
cf. 0 p. 550]. 

The Bynax{U,p,q) entries hold the maximum probability of a combination of a 
derivation consisting of only a terminal production rule U ^ a and a partial run from 
pto q generating a string consisting only of the symbol cr. This imposes a restriction on 
the path from p to q preventing the use of standard graph algorithms. Still, we only need 
to combine transitions from non-silent states with preceding and succeeding partial runs 
not emitting any symbols, i.e., with entries of Amax- However, if we compute the i?max 
entries directly from the equation 



.^max(^7 Pj q) 



max {Pg{U a) ■ r) ■ Amax(s, q)} (7) 

{U^<r)eG • 

p—*qGM 



we use time 0(|Pt | • m • n?). This could possibly be the dominating term in the overall 
time requirements for computing the value of as we will see later. 

Hence, we will specify a more efficient way to compute the Smax(C7, p, q) entries. 
If p is a non-silent state the optimal choice of a preceding partial run not emitting any 
symbols must be the empty run. Thus 

Bmax{U,p, q) = max {Pg{U a) ■ e“ • ■ A„,ax{r, g)} (8) 

{U^cr)eG F. F. 

p^rGM 



for all entries of i?max with p non-silent. Having computed these entries, we can now 
proceed to compute the entries for p silent by 



77max(f7, p, g) — max {^max(P5^) ‘ Hniax(^, ^7 g) } ■ (9) 

{u^(T)eG 

r G M, r non-silent 



Computing the Bmax array this way reduces the time requirements to 0(|Pt | ■ n^). 

We are now ready to specify how to determine the value of ® , i.e., how to compute 
the entries of the Cmax array. An entry C'max(C^, P, g) holds the maximum probability of 
a combination of a derivation from f7 in G and a partial run from p to g in M yielding 
the same string. The following equation for computing an entry of Gmax closely follows 

Fig. El 



Gmax(G,p,g) = max 



(Pg{U ^ e) ■ A^^^{p,q) 

I Bmax(U,p, q) 

I max {Pg{U XY) ■ G„ 



x(^?Pj ‘ Gniax(F^j ^7 g) 

([/ ^ XY) e G,r e M} 



( 10 ) 



The only exception is that there is no harm in considering the same combination of a 
derivation and a partial run several times when working with maxima instead of sums. 
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Hence, there is no restriction on the type of the state (denoted by r) that we maximize 
over in the general case. In an actual implementation one might want to retain the type 
restriction to speed up the program, though. 

Again, this is slightly beyond a shortest path problem. However, (0) gives us a 
number of inequalities for each of the entries of Cmax, similar to |01 Lemma 25.3]; 
Cma.x{U,p,q) must be at least as large as any of the terms on the right-hand side of 
(d. Thus, we can use a technique very similar to the relaxation technique discussed 
in lil pp. 520-521]; if at any time C^a.x{U,p,q) < Pg{U XY) ■ C'max(A,p, r) • 
Cma.x{y,r,q) for any {U XY) G G and r G M we can increase Cmi,x{U,p,q) 
to this value. This means that we could start by initializing each Cniax{U,p,q) to 
max{PG(t^ ^ e) • Amax(p, ?), Smax(C^,P, ?)} and then keep updating C^ax by it- 
erating over all possible relaxations until no more changes occur. This process will 
eventually terminate as no entry can depend cyclicly on itself, as discussed above. 

The worst-case time requirements of this scheme are quite excessive, though. Thus, 
the question of how to order the relaxations so as to compute Cmax most efficiently still 
remains. We propose an approach very similar to Dijkstra’s algorithm for computing 
single-source shortest paths in a graph, cf. 0 Sect. 25.2]. Assume that S is the set 
of entries for which we have already determined the correct value, and that we have 
performed all relaxations combining the correct values of these entries (and initialized 
all entries using the A^^ax and i?niax arrays as mentioned in the preceding paragraph). 
Let Cmax{U,p, q) be an entry of maximum value not in S. We claim that the current 
value of Cmax{U, p, q) must be correct, i.e., that it cannot be increased. In fact, no entry 
not in S can have a correct value larger than Cmax{U,p, q). The reason for this is that 
any relaxation not combining two entries both in S — which we assumed have already 
been performed — will involve an entry not in S, and thus with current value at most 
Cmax{U,p,q). As the other entry used in the relaxation can have a value of at most 
1, no future relaxations can lead to values above Cynax{U,p, q). Hence, we can insert 
Cmax{U,p, q) in S and perform all relaxations combining Cmax{U,p, q) with an entry 
from S. This idea is formalized in algorithm [D 

So what are the time requirements of algorithm To some extent this of course 
depends on the choice of data structures implementing the priority queue PQ and the 
set S, but a key observation is that each possible relaxation, i.e., combination of two 
particular entries, is only performed once, namely when the entry with the smaller value 
is inserted in S. Hence, algorithm[I]performs 0(|P„| • n^) relaxations. One can observe 
that for each relaxation we need to perform the operation increasekey on PQ, while all 
other operations on PQ are performed at most 0(|y| • times. Thus, increasekey is 
the most critical operation, why we will assume that PQ is implemented by a Fibonacci 
heap. This limits the time requirements for all operations on PQ to 0(|P„| • rA + \V\ ■ 

■ log(|y| • n)). 

For the set S we need to be able to insert an element efficiently, and to efficiently 
iterate through all elements with a particular non-terminal and state. But having already 
set aside time 17 (|P|„ • n^) for the priority queue operations, the operations on S do 
not need to be that efficient. As it turns out, it is actually sufficient to maintain S sim- 
ply as a three-dimensional boolean array indexed by {U,p, q). This makes insertion a 
constant time operation. However, it does not allow for an efficient way to iterate over 
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/* Initialization */ 

PQ = 0, 5 = 0 

for all f/ € G, p,q € M do 

PQ.insert((C/,p, g);min{PG(f/ ^ e) • Ama.^{p, q) , ,p, q)}) 

/* Main loop where one entry is fixed at a time V 
while PQ is not empty do 

/* Fix the entry with highest probability not yet fixed */ 

{U,p, q)\x = PQ. deletemax 

S. insert((P,p, qfix) 

/* Combine this entry with all feasible, fixed entries */ 

for all X,Y £ G with {X — > UY) £ G and all r £ M with (V, q, r)-,y £ S do 
PQ. increasekey((X,p, r); x ■ y) 

for all X,Y £ G with {X YU) £ G and all r G M with (F, r,p)-,y £ S do 
PQ. increasekey((X, r, q)-, y ■ x) 

Algorithm 1 : The algorithm for computing an optimal parse of a hidden Markov model 
M by a stochastic context-free grammar G. 



all elements in S with a particular non-terminal and state, short of iterating over all 
elements with that non-terminal and state and test the membership of each individual 
element. However, this turns out to be sufficiently efficient. In some situations, espe- 
cially in the beginning when there are only a few elements in S, we might test the 
membership of numerous elements not in S. But each membership test can be asso- 
ciated with a relaxation involving a particular pair of elements, namely the relaxation 
that is performed if the test succeeds. Furthermore, for each relaxation we will only test 
membership twice, once for each element that the relaxation combines. Hence, the total 
time we spend iterating over elements in S is 0(|P„| • n^). Thus, the time requirements 
for algorithmQ] isO(|P„|-n^-|-|y|-n^ - log ( | H | • n) ) . Combined with the time complex- 
ity of computing the A^ax and Pmax arrays, this leads to an overall time complexity of 
0(|P| • -f |y I -logdHI • n)) for determining the optimal parse of a general hidden 
Markov model by a stochastic context-free grammar (having computed A max, Pmax, 
and Cmax it is easy to find the optimal parse by standard backtracking techniques). This 
should be compared with the time requirements ofO(|P| • |s|^) for finding the optimal 
parse of a string s by the CYK algorithm. 

In the above description we did not use any assumptions about the structure of the 
hidden Markov model M. A natural question to ask is thus how much can be gained 
with respect to time requirements, by restricting our attention to left-right models. That 
it is not a lot should not be surprising, considering that we are already close to the 
complexity of the CYK algorithm. However, we can observe that Cmsix{U,p,q) can 
only depend on entries Cmsix{U' ,p' , q') with p < p' < q' < q (where the ordering is 
with respect to the implicit partial ordering of states in a left-right model). Thus, we can 
separate (GS|l into O(n^) systems, one for each choice of p and q, that can be solved 
one at a time in a predefined order. Hence, the priority queue only needs to hold at 
most |1/| elements at any time, reducing the time complexity for finding the optimal 
parse to 0(|P| • -f |H| • • log |H|). More importantly, though, as we only need a 

priority queue with at most \ V\ elements and \ V\ will usually be very small, it might be 
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(((((((..(((( ))))((((((...)))))) ( ( ( ( ( )))))))))) ) ) . 

Seql GGGGAUGUAGCUCAGU- -GG-UAGAGCGCAUGCUUCGCAUGUAUGAGGCCCCGGGUUCGAUCCCCGGCADCD- - 
-CCA 

2 (((((((•■■(((((((((( ( ) ) ) )))))))) ( ( ( ( ( )))))))))))) 

CGGCACGUAGCGCAGCCUGG-UAGCGCACCGUCCUGGGGUUGCGGGGGUCGGAGGUUCAAAnCCDCDCGDGCCGACCA 

.((((((( ( ( ( ( )))))))))))(((((((((((((((((((...)))))))))))))))))))... 

GGGCAUGU-GCGCAGU--GG GCGCACAUGCCUCGGGAUGAGGGGGUCCGAGGUUCGGACCCCCDCADCCCGACCA 

T T 

Fig. 2. Alignment and predicted secondary structure of the two sequences, seql and 
seq2, used to construct the trna2 model, and the sequence and secondary structure 
of the maximal parse of trna2 aligned according to the states of trna2 emitting the 
symbols. In the two positions indicated with an | the sequence of the maximal parse 
does not match any of the two other sequences. 



feasible to replace the Fibonacci heap implementation with implementations that have 
worse asymptotic complexities but smaller overhead. Thus, if we just implement PQ 
with an array, scanning the entire array each time a deletemax operation is performed, 
the complexity of parsing a left-right model only increases toO(|P| ■ + \V\^ ■ n^) 

with the involved constants being very small. 



6 Results 

AlgorithmQ]has been implemented as the program CStoRM (Comparison of Stochas- 
tic (Random) Models) and we are currently working on adding computation of the co- 
emission probability to the implementation. The implementation is available at 
http : / / WWW . cse . ucsc . edu/ 'riynqsoe/cstorm. tar . qz As an illustrat- 
ing experiment we have used the program to parse the trna2 profile hidden Markov 
model that is part of the test suite for the SAM software distribution available at 
http://www.cse.ucsc.edu/research/coTnpbio/sam.html This model 
is built from the alignment of seql and seq2 shown in Fig. 0 with the symbol emission 
probabilities in each position being a little less than 0.1 for symbols not present in that 
position in the alignment and the rest of the probability distributed evenly among the 
symbols present in that position in the alignment. Each match-state has a probability 
of at least 0.85 — and for positions without gaps in the alignment more than 0.97 — 
for choosing the transition to the next match-state. The grammar used is the stochas- 
tic context-free grammar for general RNA secondary structure presented in [Q. This 
grammar was also used for predicting the secondary structure of seql and seq2 shown 
in Fig.0 

As is evident from Fig.|2l the maximal parse does a poor job at finding the common 
structural elements of the two sequences. This might in part be explained by the fact 
that only about half the base pairs of the structures of each of the sequences is shared 
with the structure of the other sequence. But not even any of the shared base pairs are 
present in the structure found by the maximal parse. Instead, one can observe that the 
structure of the maximal parse is much more dense than the structures of seql and seq2, 
with only a few bases not being part of one of the two uninterrupted helices constituting 
the structure. This is probably an indication of the main problem of using the maximal 
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parse to predict the secondary structure. In each position we can choose a symbol so as 
to construct a sequence with a structure of very high probability, i.e., the maximal parse 
seems to be a question of highly probable structures finding matching sequences instead 
of what we are really looking for, highly probable sequences finding a matching struc- 
ture. This is further supported by the two positions where the maximal parse disagrees 
with seql and seq2, especially as seql and seq2 agree in these positions — the sequence 
obtained by changing the symbols in these two positions to the symbols shared by seql 
and seq2 would be roughly 80 times as probable in the hidden Markov model. 

Another problem is that the maximum is not very good for discriminating states 
that exhibit complementarity from states that do not exhibit complementarity. E.g. a 
state that has probabilities 0.5 for emitting either a C or a G gets a lower probability 
if paired with a state identical to itself, than if paired with a state that has probability 
0.51 for emitting a C and 0.49 for emitting a A. However, having the framework of 
algorithm [I] it is easy to modify the details to accommodate a scoring of combinations 
of pairs of states and derivations introducing base pairs that captures complementarity 
better. Furthermore, the co-emission probability will indeed capture that the two C/G- 
emitting states exhibit better complementarity than the C/G-C/A pair. Thus, a better idea 
of a common structure might be obtained by looking at the probability that two states 
emit symbols that are base paired for all pairs of non-silent states, similar to [EHim. 
Indeed, as the dependencies of the energy rules commonly used for RNA secondary 
structure prediction can be captured by a context-free grammar, one can also combine 
the computation of the co-emission probability as discussed in this paper with the com- 
putation of the equilibrium partition function presented in [HI to obtain probabilities 
for base pairing of positions including both the randomness of base pairing captured by 
the partition functions as well as the variability of a family of sequences captured by a 
(profile) hidden Markov model. 



7 Discussion 



In this paper we have considered the problem of comparing a hidden Markov model 
with a stochastic context-free grammar. The methods presented can be viewed as natu- 
ral generalizations of methods for analyzing strings by means of stochastic context-free 
grammars, or of the idea of comparing two hidden Markov models in A natural 
question is thus whether we can further extend the results to comparing two stochastic 
context-free grammars. If we could determine the co-emission probability of — or just 
the maximal joint probability of a pair of parses in — two stochastic context-free gram- 
mars, we could also determine whether the languages of two context-free grammars 
are disjoint, simply by assigning a uniform probability distribution to the derivations of 
each of the variables and asking whether the computed probability is zero. However, 
a well-known result in formal language theory states that it is undecidable whether 
the languages of two context-free grammars are disjoint ifTTl Theorem 11.8.1]. Hence, 
we cannot generalize the methods presented in this paper to methods comparing two 
stochastic context-free grammars with a precision that allows us to determine whether 
the true probability is zero or not. 
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In inn we use the co-emission probability to compute the L 2 -distance between the 
probability distributions of two hidden Markov models. Having demonstrated how to 
compute the co-emission probability between a hidden Markov model and a stochas- 
tic context-free grammar, the only thing further required to compute the L 2-distance is 
the co-emission probability of the grammar with itself. As stated above, the problem 
of computing the co-emission probability between two stochastic context-free gram- 
mars is undecidable, but one could hope that computing the co-emission probability 
of a stochastic context-free grammar with itself would be easier. However, given two 
stochastic context-free grammars G 1 and G2 we can construct an aggregate grammar G ' 
where the start symbols of Gi and G2 are chosen with equal probability one half. 
It is easy to see that the co-emission probability of G' with itself is the sum of the 
co-emission probabilities of Gi and G2 with themselves plus twice the co-emission 
probability between Gi and G2. Hence, computing the co-emission probability of a 
stochastic context-free grammar with itself, or the L2- or Hellinger-distances between 
the probability distributions of a context-free grammar and a hidden Markov model, is 
as hard as computing the co-emission probability between two stochastic context-free 
grammars. 

In this paper we have presented the co-emission probability as a measure for com- 
paring two stochastic models. However, the co-emission probability has at least two 
other interesting uses. First, it allows us to use one model as a prior for training the 
other model, e.g., using the distribution over sequences of a hidden Markov model 
as our prior belief about the distribution over sequences for a stochastic context-free 
grammar we want to construct. Secondly, it allows us to compute the probability that 
two stochastic models have independently generated a sequence s given the two models 
generate the same sequence. I.e. we can combine two models under the assumption of 
independence. 
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Abstract. Assessing statistical significance of over-representation of exceptional 
words is becoming an important task in computational biology. We show on two 
problems how large deviation methodology applies. First, when some oligomer 
H occurs more often than expected, e.g. may be overrepresented, large devia- 
tions allow for a very efficient computation of the so-called p-value. The second 
problem we address is the possible changes in the oligomers distribution induced 
by the over-representation of some pattern. Discarding this noise allows for the 
detection of weaker signals. Related algorithmic and complexity issues are dis- 
cussed and compared to previous results. The approach is illustrated with three 
typical examples of applications on biological data. 



1 Introduction 

Putative DNA recognition sites can be defined in terms of an idealized sequence that 
represents the bases most often present at each position. Conservation of only very 
short consensus sequences is a typical feature of regulatory sites (such as promoters) 
in both prokaryotic and eukaryotic genomes. Structural genes are often organized into 
clusters that include genes coding for proteins whose functions are related. Data from 
the Arabidopsis genome project suggest that more than 5% of the genes of this plant 
encode transcription factors. The necessity for the use of genomic analytical approaches 
becomes clear when it is considered that less than 10% of these factors have been ge- 
netically characterized. Transcription-factor genes comprise a substantial fraction of 
all eukaryotic genomes, and the majority can be grouped into a handful of different, 
often large, gene families according to the type of DNA-binding domain that they en- 
code. Functional redundancy is not unusual within these families; therefore the proper 
characterization of particular transcription-factor genes often requires their study in the 
context of a whole family. The scope of genomic studies in this area is to find cis- 
acting regulatory elements from a set of co-regulated DNA sequences (e.g. promoters). 
The basic assumption is that a cluster of co-regulated genes is regulated by the same 
transcription factors and the genes of a given cluster share common regulatory motifs. 
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On the other hand, the importance of whole-genome studies is highlighted by the 
fact that approximately 50% of the Ambidopsis genes have no proposed function, and 
that the Arabidopsis genome, like those of Saccharomyces cerevisice, Caenorhabditis 
elegans and Drosophila, contains extensive duplications. These include many tandem 
gene duplications as well as large-scale duplications on different chromosomes. The 
prevalence of gene duplication in Arabidopsis implies that redundancy is a problem 
that will have to be dealt with in the functional analysis of genes. In order to find long 
approximate repeats, some approaches consist in searching for short exact motifs that 
appear in the sequences. 



In both problems, there is a need for identification of over-represented motifs in the 
considered sequences. Research is very active in this area 1I4I25I1 512 112412512211 41711 V 
|!5)|. In these works, one searches for exceptional patterns in nucleotidic sequences, us- 
ing various tools to assess the significance of such rare events. 



Large deviation is a mathematical area that deals with rare events; to our knowl- 
edge, it has not been used in computational biology, although the extremal statistics on 
alignments ['5l can be viewed as large deviation results. Nevertheless, our recent results 
in t3j, that extend preliminary results in 111 71 show it may be a very powerful method to 
assess statistical significance of very rare events. 



The first problem we address is the following. One considers a candidate, e.g. a 
word that occurs more often than expected. One needs to quantify this difference be- 
tween the observation and the expectation. Among the classical statistical tools, the 
so-called p- values are much more precise than the Z-scores (or the x-scores). The draw- 
back is that their computation is considered as much harder. Large deviations provide a 
very efficient way to compute them in some cases. 



As a second problem, we consider some consequences of the over-representation of 
a word on a sequence distribution. In particular, it has been observed that, whenever a 
word is overrepresented, its subwords or the words that contain it, look overrepresented. 
Such words are called below artifacts Q. It is a desirable goal to choose the best el- 
ement in the set composed of a word and its artifacts. It is also important to discard 
automatically the “noise” created by the artifacts, in order to detect other words that 
are potentially overrepresented. An important example is the noise introduced by the 
Alu sequences. Another one is the ^-sequence GNTGGTGG in H. influenzae [CH. We 
provide some mathematical results and the algorithmic consequences. 

The efficiency of this approach comes from the existence of explicit formulae for 
the (conditioned) distribution. Large deviations allow for a very fast computation. More- 
over, due to the “simplicity” of the result -if not of the proof-, their implementation is 
easy and provides numerically stable and guaranteed computations. The reason is that 
the problem reduces to the numerical computation of the root of a polynomial. Other ap- 
proaches need the numerical computation(s) of exp, log functions. The implementation 
is delicate and machine dependent. Hence, the large deviation approach occasionally 
corrects commonly used approximations. Interestingly, they apply with the same cost 
to self-overlapping patterns. This is a great improvement on the approaches based on 
binomial or related formulae E32l, where autocorrelation effects are neglected. Still, 
our computation that takes into account the correlations is much faster and precise than 
computing the approximation. Approach is valid for various counting models. For a 
sake of clarity, we present it for the most commonly used, the overlapping model [ 1271 . 
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In Sectional we present and discuss the statistical criteria commonly used for eval- 
uating over-representation of patterns in sequences. Section 0is devoted to our mathe- 
matical results. In Sections0Eland|3l we validate our approach by a comparison with 
published results derived by other methods that are computationally more expensive. 
Finally, in SectionQ we discuss possible improvements and present further work. 



2 Statistical Tools in Computational Biology 

In the present section, we present basic useful definitions for statistical criteria and we 
briefly discuss their limits, e.g the validity domains and the computational efficiency. 
Below, we denote by 0(H) the number of observations of a given pattern H in a given 
sequence. Depending of the application, it may be either the number of occurrences 
fi^ or the number of sequences where it appears [HES- 



Z -Scores. Many definitions of this parameter can be found in the literature. Other 
names can be used: see for instance the so-called contrast used in HI. A common 
feature is the comparison of the observation with the expectation, using the variance as 
a normalization factor. A rather general definition is 



Z(H) = 



E{H) - 0(H) 

VnH) 



( 1 ) 



where H is a given pattern or word, 0(H) is the observed number of occurrences, 
E{H) the expectation and V{H) the variance. Many recent works allow for a fast 
computation of E and V , hence Z. Relevant approximations are discussed in m, 
notably the Poisson approximation V = E. Nevertheless, if Z-scores are a very efficient 
filter to detect potential candidates, they are not precise enough. Notably, this parameter 
is not stable enough for very exceptional words, e.g. when the expectation is much 
smaller than 1. This will be detailed in Section 0 Moreover, it is relevant only for large 
sequences, and does not adapt easily to the search in several small sequences. 



p-Values. For each word that occurs r times in a sequence or in a set of N (related) 
sequences, one computes the probability that this event occurs just “by chance”: 

pval{B) = P(0(H) > r) . (2) 



When the expectation of a given word is much smaller than 1, a single occurrence is a 
rareevent. In this case, the p-value is defined as: P(0(H) > r knowing that 0(H) > 1), 



e.g.; 



pval{S) 



P(0(H) > r) 

Pioin) > 1 ) ■ 



( 3 ) 



The computation is performed in two steps. First, the probability that H occurs in a 
given sequence, e.g. P(0(H) > 1), is known. An exact formula is provided in irfTll 
and used in Q. An approximated formula is often used, for instance in software RSA- 
tools (http : //www.ucmb -ulb . ac .be/bioinf ormatics/rsa- tools/l or 
in m . Then two different cases occur. 
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Set of small sequences. The p- value in is the probability that r sequences out of 

N contain H; when they are independent, it is given by a binomial formula: 

pval{B) = ( ^ ) (P(0(H) > - P(0(H) > (4) 

Large sequences. One needs to compute P(0(H) > r, through the exact formulae 
in 0 or an approximation. 

This p- value is evaluated in through a significance coefficient. Given a motif H, the 

significance coefficient is defined as 

Sig = -logio[P{0(H) >r)*D], 

which takes into account the number of different oligonucleotides D. The number of 
distinct oligonucleotides depends whether one counts on a single strand or on both 
strands. 

3 Main Results 

3.1 Basic Notations 

The model of random text that we handle with is the Bernoulli model, one assumes 
the text to be randomly generated by a memoryless source. Each letter s of the alphabet 
has a given probability ps to be generated at any step. Generally, the ps are not equal. 

Definition 1. Given a pattern H of length m on the alphabet S and a Bernoulli distri- 
bution on the letters ofS, the probability o/H is defined as 

m 

p(H) = 

i=l 

where hi denotes the i-th character o/H. By convention, empty string e has probabil- 
ity 1. 

Finding a pattern in a random text is, in some sense, correlated to the previous occur- 
rences of the same or other patterns 1 II 31 . Hence for example, the probability of hnding 
Hi = ATT knowing that one has just found H 2 — TAT is - intuitively - rather good 
since a T right after H 2 is enough to give Hi. Correlation polynomials and correlation 
functions give a way to formalize this intuition. 

Definition 2. The correlation set of two patterns H; and Hj is the set of words w which 
satisfy: there exists a non-empty suffix v o/Hj such that vw = Hj. 7t is denoted Aij. If 
Hi = Hj, then the correlation set is called the autocorrelation set o/Hj. 

Thus for example, the correlation set of H 1 = ATT andH 2 = TAT is ^12 = {HT};the 
autocorrelation set of Hi is {e}, while the autocorrelation set of H 2 is {e, AT}. Empty 
string always belong to the autocorrelation set of any pattern. 
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Definition 3. The correlation polynomial of two patterns Hi and Hj of length rrii and 
rrij is defined as: 

where |w| denotes the length of word w. IfTii = Hj, then this polynomial is called the 
autocorrelation polynomial o/Hj. The correlation function is: 

A.,W = (l-z)A,(z) + P(Hj)z-^- . 



. When Hi = Hj, the correlation function can be written Di. 

The most common counting model is the overlapping model, overlapping occurrences 
of patterns are taken into account. It is as follows. For example, consider two oligonu- 
cleotides Hi = ATT, H 2 = TAT and a sequence TTATTATATATT. This sequence contains 
2 occurrences of Hi and 4 occurrences of H 2 , as shown below: 



Hi H2 Hi 

T I_A^T_A_r A I_A^T T T 

H2 Ha Ha 



It turns out m that our main results rely on the computation of the (real) roots of a 
polynomial equation: 



Definition 4. Let a be a real number such that a > P(H 1 ). Let ( Ea) be the fundamental 
equation.- 



Di{zf - (1 + (a - l)z)Di{z) - az(l - z)D[{z) = 0 . (5) 



Let Za be the largest real positive solution of Equation (Ea) that satisfies 0 < Za < L 
The number Za is called the fundamental root of(Ea). 



3.2 p-Value for a Single Pattern 

The main result of this section is the theorem below, proven in that provides the 
probability for the observed number of occurrences to be much greater than the expec- 
tation. 

Theorem 1. Let Hi be a given pattern, and k be its observed number of occurrences 
in a random sequence of length n. Denote a = ^ and assume that a > P(Hi). Then: 

pval(Hi) = Prob(0{Hi) > fc) « — ( 5 ) 

2aas/n 

where 

1(a) = aln r (7) 

\Di(Za) + Za - I J 

_ 1 l _ 2 / 2T)'l(Za) (1 ~ Zg)Dl (Zg) 

\ Di(Za) Di(Za) + (1 - Za)D[(Za) 

^(H)C 1 

Di(Za) + (1 - Za)D[(Za) 




Sa = l0g[ 



(9) 
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and Za is the fundamental root of Ea- I (a) is called the rate function. Additionally: 

Pro6(0(Hi) = /c) « \^^-nI(a)+5a (jQ) 

a a V 27m 

Remark 1. When a = P(Hi), the number of Hi -occurrences is equal to its expected 
value. Conditional variance a a in 0 becomes: a — P(Hi)(2Hi(l) — 1 -f (1 — 2m) 
P(Hi)) (where m denotes the length of Hi)„ e.g. the unconditional variance computed 
by various authors 11271 19T . 

Remark2. The two probabilities Prob{0(Hi) > k) and Prob{0(Hi) = k) appear to 
be very similar in magnitude. 

These results turn out to be very precise. It is shown in 01 that the relative error is 
0(l/n) ; that is to say, the neglected term is upper bounded by e which is 

exponentially small. A numerical comparison with the exact computation implemented, 
for the Bernoulli model, in III is given in Section E] It appears that, when the random 
sequence is large, the formulae above provide an attractive alternative to the exact com- 
putation. Moreover, they also hold for the Markov model iB- 

Another approach [l^ is the approximation of the word counting distribution by 
a compound Poisson distribution with the same mean, e.g. nP(H) in the overlapping 
counting model. For the compound Poisson distribution defined in [GSI, the variance is 
not asymptotically equal to the variance of the process. As a consequence, the validity 
domain is restricted to the domain where the difference is small. More important, the 
normalization factor is improper. This implies an error on the rate function I (a), and 
the neglected term in the validity domain, is not exponentially small [Gl. Numerical 
evaluation of the compound Poisson distribution can be found in 1 1121 . 

3.3 Conditioning by an Overrepresented Word 

In this subsection, we assume that a pattern Hi has been detected as an overrepre- 
sented word and we provide mathematical results to investigate the changes induced on 
the sequence distribution. Intuitively, the artifacts of an overrepresented word should 
look overrepresented. For example, if H i = AATAAA, any word H2 = ATAAAN 
is an artifact. A rough approximation of its expected value is 0(Hi) x As 

0(Hi) >> i?(Hi), this is much greater than unconditioned expectation i7(H i ) x p^- 
The theorem below, proven in [0, establishes the precise formulae: 

Theorem 2. Given two patterns Hi and H 2 , assume the number of Ai-occurrences, 
0(Hi), is known and equal to k, with a = — > P(Hi). Then, the conditional expecta- 
tion of 0 ( 112 ) is: 

P(0(H2)/0(Hi) = A:) - na (11) 

where a is a function of the autocorrelation functions, the probabilities and a: 

Dl,2(Zg) X D2,l(Zg) 

“ "^Dr(Zg)(D^(Zg) + Zg-l) ^ ^ 

and Zg is the fundamental root of Equation (Q). Moreover, the variance is a linear 
function ofn. 
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Remark 3. In the central region, e.g. k = nP(Hi), substitutions a = P(Hi) and Za = 
1 in 111 II yield a = P(H 2 ), if Hi and H 2 do not overlap. 

Once a dominating signal has been detected, one looks for a weaker signal by a 
comparison of the number of observed occurrences of patterns with their conditional 
expectations. This procedure automatically eliminates artifacts. An example is provided 
in Section 0 It also allows for a choice of the best candidate between a word and its 
artifacts. 

Computational Complexity. Another approach is used in Regexpcount [Hill. Although 
our formal proof of Theorem Q relies on similar mathematical tools, our explicit formu- 
lae allow for skipping the expensive intermediate computations (bivariate generating 
functions, recurrences,...), hence provide a much faster algorithm. 

4 Tandem Repeats in B. Subtilis and A. Thaliana 

In |l7l), authors search for localized repeats with a statistical filter. Software Except re- 
lies on a simple basic idea: long approximate repeats are likely to contain multiple exact 
occurrences of shorter words. DNA sequences are divided into overlapping fragments 
of size n. This size n is a parameter of the algorithm chosen for each run. Typically, 
n ranges from 250 to 5000. In each window, the p-value is computed for any pattern 
that occurs more than once. As the total number of occurrences, r, remains relatively 
small (typically 3 to 5), exact computation through generating functions is (theoreti- 
cally) possible. Nevertheless, this approach, chosen by the authors, is computationally 
expensive. Typically, r repeated multiplications of polynomials of degree n. This gives 
a time complexity 0{n log n log r), if a Fast Fourier Transform is used, and numerical 
stability is rather delicate. Large deviation computation for rare events reduces to the 
numerical computation of real roots of a polynomial equation, namely Equation (Ell. 
Hence, it is easier to program, faster and much more stable numerically. This was effi- 
ciently implemented in Maple and compared with the results published in [|7|; Table E 
gives the results. The results in O are given for one 2008 nucleotides long fragment 



Table 1. Measures on the 7 Oligonucleotides Considered in [ 13 - 



Oligomer 


Obs. 


p-val. 
(large dev.) 


p-val. 

0 


Z-sc. 


AATTGGCGG 


2 


8.059 xl0"'‘ 


8.343x10“" 


48.71 


TTTGTACCA 


3 


4.350x10“'" 


4.611x10“'" 


22.96 


ACGGTTCAC 


3 


2.265x10“'" 


1.458x10“'" 


55.49 


AAGACGGTT 


3 


2.186x10“'" 


2.780x10“*" 


48.95 


ACGACGCTT 


4 


1.604x10“'" 


0.982x10“*" 


74.01 


ACGCTTGG 


4 


5.374x10“"'" 


4.391x10“"'" 


84.93 


GAGAAGACG 


5 


0.687x10“"" 


1.180x10“"" 


151.10 



in A. thaliana where 5 approximate tandem repeats of a 40-uple were found. For all 
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patterns, the occurrence probability nP(H) ranges between 10 and 10“^. For each 
oligonucleotide, the brst value is the number of occurrences in the window and the sec- 
ond one is the p- value computed by our large deviation formulae, where the correcting 
term 6a has been neglected. The third one is the p- value computed in [El with a generat- 
ing function method and the last one is the Z-score. We notice that, for any pattern, the 
p-values computed with two different methods are of the same magnitude order (this is 
illustrated in Figured where the logs of p-values are plotted.) However, they can differ 
up to a factor 1.72. This is due to the approximation done in our calculations. When a 
increases, the difference Za~ I also increases as well as the contribution of e“^“. Nev- 
ertheless, it is worth noticing that the p-value order is almost the same. One inversion 
occurs between patterns ACGGTTCAC and AAGACGGTT, that have similar p-values. 

On the other hand, the last column of the table confirms that Z-score is not adequate 
for very rare events. Patterns AAGACGGTT and AATTGGCGG have the same Z-score 48, 
while p-values have a ratio 100. For patterns ACGACGCTT and ACGCTTGG, the two 
parameters dehne a different order. The same inversion appears between AATTGGCGG 
and TTTGTACCA. 
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Fig. 1. Graphical comparison of —logs of p-values of Table [I] Abscissae refer to the 
ordering of motifs in the table. Circles denote large deviations formula and squares the 
results of II- 



5 Oligonucleotide Frequencies from Yeast Upstream Regulatory 
Regions 

InEa, van Helden et al. study the frequencies of oligonucleotides extracted from reg- 
ulatory sites from the upstream regions of yeast genes. Statistical signihcance of the 
oligonucleotides occurring in the 800 bp upstream sequences of regulatory regions 
is assessed by evaluating the probability of observing r or more occurrences of the 
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oligonucleotide in the regulatory sequence, using the binomial formula. In [E3> the 
probabilities are not computed in the Bernoulli model. For a given oligonucleotide, 
the authors count the number / of its occurrences in the non-coding sequences of yeast. 
Then, an approximate formula for P(0(H) > 1) is given and p- value follows through a 
computation of binomial formula ®- It is observed in El that these binomial statistics 
prove to be appropriate, except for self-overlapping patterns such as AAAAAA, ATATAT, 
ATGATG. As a matter of fact, auto-correlation does not affect the expected occurrence 
number, but increases the variance m .In other words, the probability to observe either 
very high or very low occurrence values is increased for auto-correlated patterns. 

Table Q compares the results of several methods to compute the significance coef- 
ficient defined above. Figure |2| presents a graphical view of this comparison. The se- 



Table 2. Computations of significance coefficient (Sig) of some hexanucleotides ac- 
cording to several methods, in the 800 bp upstream region of ORF YGR022c of Sac- 
charomyces cerevisice chromosome VII. 0(H)'. number of observed occurrences. BF1\ 
Binomial formula with expected occurrences computed as in [E3- BF2\ Binomial for- 
mula, Bernoulli probabilities. LD\ Large deviations, without considering overlaps. LDo: 
Large deviations considering overlaps. GF: Generating function approach [13. Column 
GF was computed using EXCEP software Q, based on formulas of IEM71 . 



Motif 


0(H) 


BFl 


BF2 


LD 


LDo 


GF 


TGATGA 


22 


20 


30.13 


31.31 


23.79 


23.92 


GATGAT 


20 


20 


26.31 


27.25 


20.89 


21.02 


ATGATG 


19 


10.03 


24.43 


25.26 


19.46 


19.59 


GATGAG 


12 


10.21 


15.18 


15.41 


15.01 


15.14 


GGATGA 


11 


20 


13.26 


13.43 


13.43 


13.56 


ATGAGG 


11 


20 


13.26 


13.43 


13.43 


13.56 


TGAGGA 


10 


10.27 


11.37 


11.50 


11.50 


11.62 


AGGATG 


9 


9.18 


9.53 


9.61 


9.61 


9.73 


GAGGAT 


9 


9.05 


9.53 


9.61 


9.61 


9.73 


TGAAGA 


8 


4.54 


5.64 


5.68 


5.68 


5.79 


AAGATG 


6 


2.55 


2.75 


2.72 


2.72 


2.83 


GAAGAT 


6 


2.39 


2.75 


2.72 


2.72 


2.83 


GATGAA 


6 


2.35 


2.75 


2.72 


2.72 


2.83 



quence considered is the 800 bp upstream region of ORF YGR022C of Saccharomyces 
cerevisice chromosome VII. By comparing BF2, LDo and GF (this last one representing 
the “more exact” result), we see that the large deviations values are very close to the GF 
ones (while their computation is much faster). The table also confirms that the overlaps 
must be taken into account when counting the significance coefficients : the three first 
patterns, for which the difference between BF2 and GF is huge, are the periodic ones. 
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Fig. 2. Graphical comparison of significance coefficients (Sig) of Table 0 Abscissae 
refer to the ordering of motifs in the table. Circles denote the binomial formulas for 
Bernoulli probabilities (BF2), diamonds the large deviations considering overlaps, and 
boxes the generating function approach (GF). 



6 Polyadenylation Signals in Human Genes 

Inffl, Beaudoing et al. study polyadenylation signals in mRNAs of human genes. One 
of their aims is to find several variants of the well known AAUAAA signal. For this pur- 
pose, they select 5646 putative mRNA 3’ ends of length 50 nucleotides and seek for 
overrepresented hexamers. Pattern AAUAAA is clearly the most represented: it occurs 
in 3286 sequences, for a total number of 3456 occurrences. Seeking for other (weaker) 
signals involves searching for other overrepresented hexanucleotides. Nevertheless, it is 
necessary to avoid artifacts, e.g. patterns that appear overrepresented because they are 
similar to the first pattern. The algorithm designed by Beaudoing et al. consists in can- 
celing all sequences where the overrepresented hexamer has been found. Hence, they 
search for the most represented hexamer in the 2780 sequences which do not contain 
the strong signal AAUAAA. 

Here we show how Theorem |2 gives a procedure for dropping the artifacts of a 
given pattern without canceling the sequences where it appears. Table [^presents the 15 
most represented hexamers in the sequences considered in m . Columns 2 and 3 respec- 
tively give the observed number of occurrences and the rank according to this criteria. 
Columns 4, 5 and 6 present the (non-conditioned) expected number of occurrences, the 
corresponding Z-score and the rank of the hexamer according to this Z-score. Here, the 
variance has been approximated by the expectation; this is possible as stated in m- 
Remark that rankings of columns 3 and 6 are quite similar: only patterns UAAAAA and 
UAAAUA do not belong to both rankings. A number of motifs look like the canonical 
one: they may be artifacts. This is confirmed by the three last columns which present, 
respectively, the expected number of occurrences conditioned by the observed number 
of occurrences of AAUAAA, the corresponding conditioned Z-score and the rank ac- 
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Table 3. Table of the most frequent hexanucleotides. Obs: number of observed oc- 
currences. Rk: Rank. Exp.: (non-conditional) expectation. Cd.Exp.: Expectation condi- 
tioned by number of occurrences of AAUAAA. 



Hexamer 


Obs. 


Rk 


Exp. 


Z-sc. 


Rk 


Cd.Exp. 


Cd.Z-sc. 


Rk 


AAUAAA 


3456 


1 


363.16 


167.03 


1 






1 


AAAUAA 


1721 


2 


363.16 


71.25 


2 


1678.53 


1.04 


1300 


AUAAAA 


1530 


3 


363.16 


61.23 


3 


1311.03 


6.05 


404 


UUUUUU 


1105 


4 


416.36 


33.75 


8 


373.30 


37.81 


2 


AUAAAU 


1043 


5 


373.23 


34.67 


6 


1529.15 


-12.43 


4078 


AAAAUA 


1019 


6 


363.16 


34.41 


7 


848.76 


5.84 


420 


UAAAAU 


1017 


7 


373.23 


33.32 


9 


780.18 


8.48 


211 


AUUAAA 


1013 


8 


373.23 


33.12 


10 


385.85 


31.93 


3 


AUAAAG 


972 


9 


184.27 


58.03 


4 


593.90 


15.51 


34 


UAAUAA 


922 


10 


373.23 


28.41 


13 


1233.24 


-8.86 


4034 


UAAAAA 


922 


11 


363.16 


29.32 


12 


922.67 


9.79 


155 


UUAAAA 


863 


12 


373.23 


25.35 


15 


374.81 


25.21 


4 


CAAUAA 


847 


13 


185.59 


48.55 


5 


613.24 


9.44 


167 


AAAAAA 


841 


14 


353.37 


25.94 


14 


496.38 


15.47 


36 


UAAAUA 


805 


15 


373.23 


22.35 


21 


1143.73 


-10.02 


4068 



cording to this criteria. It is clear that artifacts are dropped out, generally very far away 
in the ranking. It is worth noticing that some patterns which seemed overrepresented 
are actually avoided: this is the case for AUAAAU which goes down from 5th to last 
place (among the 4096 possible hexamers, only 4078 are present in the sequences). As 
AUAAAU is an artifact of the strong signal, this means that U is rather avoided right after 
this signal. 

The case of UUUUUU in rank 2 is particular: this pattern is effectively overrepre- 
sented, but was not considered by Beaudoing et al. as a putative polyadenylation signal 
because its position does not match with observed requirements (around -15/- 16 nu- 
cleotides upstream of the putative polyadenylation site.) It should also be stated that the 
approximation of the variance by the expectation that we do here for all patterns is not 
as good for periodic patterns like UUUUUU as for others \EB- By this way, variance of 
UUUUUU is under-evaluated; so its actual Z-score is significantly lower than the one 
given in the table. 

Now over-representation of AUUAAA (rank 3) is obvious; this is the known first 
variant of the canonical pattern. We remark that the following hexamer, UUAAAA, is an 
artifact of AUUAAA. It suggests to define a conditional expectation, or, even better, a 
p-value that takes into account the over-representation of two or more signals instead 
of one: in this example, AAUAAA and AUUAAA. This extension of Theorem Elis the 
subject of a future work. 

As it is mentioned above, the Z-score is not precise enough, and this remark also 
holds for conditioned Z-scores. In a second step, the authors of [□ computed a p-value 
defined by formula (0. This formula is approximated by the incomplete /3-function. 
Nevertheless, any computation is rather delicate, and machine dependent due to numer- 
ous call to exp and log functions. The numerical stability necessitates a very careful 
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use of real precision. It is worth noticing that large deviation principle applies for a 
Bernoulli process, with explicit values for the rate function and a a B- 



1 Conclusion and Perspectives 

In this paper, we illustrated a possible use of large deviation methods in computational 
biology. These results allow, in some cases, a very fast computation of p-values that 
is numerically stable. These preliminary results are quite appealing and should be ex- 
tended in several directions. First, it may be necessary to eliminate several strong in- 
dependent signals m . A second task is the simplification of our formulae for artifacts: 
this would allow to achieve automatically the choice between a word and its sub words. 
A third task is the extension to the computation of the p- value for the conditioned case. 
Finally, regulatory sites may also be associated with structured motifs or spurious 
motifs Q and extension to this case should be realized. 
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Abstract. We describe algorithms for pattern-matching and pattern-learning in 
TOPS diagrams (formal descriptions of protein topologies). These problems can 
be reduced to checking for subgraph isomorphism and finding maximal common 
subgraphs in a restricted class of ordered graphs. We have developed a subgraph 
isomorphism algorithm for ordered graphs, which performs well on the given 
set of data. The maximal common subgraph problem then is solved by repeated 
subgraph extension and checking for isomorphisms. Despite its apparent ineffi- 
ciency, this approach yields an algorithm with time complexity proportional to 
the number of graphs in the input set and is still practical on the given set of data. 
As a result we obtain fast methods that can be used for building a database of 
protein topological motifs and for the comparison of a given protein of known 
secondary structure against a motif database. 



1 Biological Motivation 

Once the structure of a protein has been determined, the next task for the biologist is to 
find hypotheses about its function. One possible approach is a pairwise comparison of 
its structure with the structures of proteins whose functions are already known. There 
are several tools that allow such comparisons, for example DALI [7] or CATH [11]. 
However there are two weaknesses with these approaches. Firstly, as the number of 
proteins with a given structure is growing, the time needed to do such comparisons is 
also growing. Currently there are about 15,000 protein structure descriptions deposited 
in the Protein Data Bank [1], but in the future this number may grow significantly. 
Secondly, even if a similarity with one or more proteins has been found, it may not be 
apparent whether this may also imply functional similarity, especially if the similarity 
is not very strong. 

Another possibility is to try a similar approach at a structural level similar to that 
used for sequences in the PROSITE database [6]. That is, precompute a database of 
motifs for proteins with known structures — i.e., structural patterns which are associated 
with some particular protein function. This effectively requires computing the maximal 
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common substructure for a set of structures. One such approach is that of CORA [10], 
based on multiple structural alignments of protein sequences for given CATH families. 

Both approaches have been successfully used for protein comparison at the se- 
quence level. The main difficulty in adapting them to the structural level is the complex- 
ity of the necessary algorithms — while exact sequence comparison algorithms work in 
linear time, exact structure comparison algorithms may require exponential time and 
the situation only gets worse with algorithms for finding maximal common substruc- 
tures. Another aspect of the problem is that it is far from clear which is the best way to 
define structure similarity. There are many possible approaches, which require different 
algorithmic methods and are likely to produce different results. 

Our work is aimed at the development of efficient comparison and maximal com- 
mon substructure algorithms using TOPS diagrams for structural topology descriptions, 
at the definition of structure similarity in a natural way that arises from such formali- 
sation, and at the evaluation of usefulness of such an approach. The drawback of our 
approach is that TOPS diagrams are not very rich in information; however it has the ad- 
vantage that it is still possible to design practical algorithms for this level of abstraction. 



2 TOPS Diagrams 

At a comparatively simple level, protein structures can be described using TOPS car- 
toons (see [4, 1 3 , 1 4] ) . A sample cartoon for 2bopA0 is shown in Figure [Ha) ; for compar- 
ison a Rasmol-style picture is given in Figure Efb). The cartoon shows the secondary 



2bopA0 





(b) Rasmol picture 



Fig. 1. TOPS Cartoon and Rasmol Picture of 2bopA0. 



structure elements (SSEs) — /3-strands (depicted by triangles) and a-helices (depicted 
by circles) — , how they are connected in a sequence from amino to carboxyl terminus, 
and their relative spatial positions and orientations. Such representations have been used 
by biologists for some time. However the graphical images do not explicitly represent 
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all topological information implied by such descriptions and there are no strict rules 
governing the appearance of a TOPS cartoon for a given protein. 

TOPS diagrams, developed by Gilbert et al. [5], are a more formal description of 
protein structural topology and are based on TOPS cartoons. Instead of representing 
spatial positions by element positions in a plane, a TOPS diagram contains information 
about the grouping of /3-strands in /3-sheets (two adjacent elements in a /3-sheet are 
connected by an H-bond, which can be either parallel or anti-parallel) and some infor- 
mation about relative orientation of elements (any two SSEs can be connected by either 
left or right chirality). Note that, in the topological sense, we reduce the set of atomic 
hydrogen bonds between a pair of strands to a single H-bond relationship between the 
strands. In principle chiralities can be defined between any two SSEs; however only 
a subset of the most important chiralities is included in TOPS diagrams — this subset 
roughly corresponds to the implicit position information in TOPS cartoons. A TOPS di- 
agram can be regarded as a graph with four different types of vertices (corresponding to 
up- or down-oriented strands and up- or down-oriented helices) and four different types 
of edges (corresponding to parallel or antiparallel H-bonds and left or right oriented 
chiralities). Moreover, the corresponding graph is ordered — each vertex is assigned a 
unique number from 1 to n, where n is the total number of vertices. In Eigure |2l the 
ordering is also indicated by placing the vertices in the order of increasing numbers 
(looking from left to right). 




Fig. 2. TOPS Diagram of 2bopA0. 



3 Pattern Matching and Pattern Discovery in TOPS 

If we describe protein secondary structure by TOPS diagrams, a natural way to char- 
acterise the similarity of two proteins is by using patterns. In general, we can define 
patterns using the same type of graphs as for TOPS diagrams. We say that a given pat- 
tern matches a given TOPS diagram if and only if the corresponding pattern graph is 
a subgraph of the corresponding TOPS diagram graph. Here we assume that subgraph 
relation also preserves the order of vertices — i.e., there is a mapping F of pattern graph 
vertices to target graph vertices such that, for any pair of vertices v and in in a pattern 
graph: 

- if the number of v is larger than the number of w, then also the number of F{v) is 
larger than the number of F{w),\ and 
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- if there is an edge between v and w, then there is an edge (of the same type) between 
F{v) and F{w). 

Figure 0 shows one of the possible patterns that matches the diagram for 2bopA0 by 
mapping vertices with numbers 1, 2, 3, 4, 5, and 6, corresponding to vertices with 
numbers 1, 2, 4, 6, 7, and 8. In practice, however, it might be useful to make the pattern 




R 



R 



Fig. 3. TOPS Pattern. 



definition more complicated. There might be reasons to require that close vertices in 
pattern (i.e., vertices with close numbers) are to be mapped to close vertices in the 
target diagram (for some natural notion of close). Alternatively it might be useful to 
require that the target graph does not contain extra edges between vertices to which 
pattern graph vertices are mapped (in this case the pattern graph must be an induced 
subgraph of the target graph). 

If we want to compare a target TOPS diagram to a set of diagrams, we can do 
this by pairwise comparisons between the target and each of the comparison sets; each 
such comparison can be made by hnding a largest common pattern for two diagrams 
and assigning a similarity measure based on the size of the pattern and the sizes of the 
two diagrams. Alternatively, if we want to use a motif-based approach, we can find the 
largest common patterns for a given set of proteins, consider these patterns as motifs, 
and check whether a pattern for some motif matches the diagram of a target protein. 
In practice the definition of a motif may be more complicated — for example, it may 
include several patterns or some additional information. 

Several algorithms for protein comparison based on the notion of patterns have al- 
ready been developed and implemented by David Gilbert. The system is available at 
http : / /www3 . ebi . ac . uk/tops/; it permits searching for proteins that match a 
given pattern or to perform pattern-based comparisons of TOPS descriptions of pro- 
teins. Our current task is to implement the more efficient algorithms that we describe 
here. These algorithms will permit the fast generation of motif databases, which we 
plan to make available on the web. 

4 Experimental Results 

4.1 Methodology and Databases 

In experiments that we have performed to date we have tried to estimate the useful- 
ness of the pattern-based protein motifs, i.e., what is the probability that the fact that 
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a protein matches a given motif implies that protein has also some real similarity with 
other proteins characterised by the same motif. To do this, we have tried to compare 
our approach against the existing CATH protein classification database. CATH [1 1] is a 
hierarchical classification of protein domain structures, which clusters proteins at four 
major levels — Class (C), Architecture (A), Topology (T) and Homologous superfam- 
ily (H). There are four different C classes — mainly alpha (class 1), mainly beta (class 
2), alpha-beta (class 3) and low secondary structure content (class 4). In most cases C 
classes are assigned automatically. The architecture level describes the overall shape of 
the domain structure according to orientations of the secondary structures; classes in 
this level are assigned manually. Classes in the topology level depend on both the over- 
all shape and connectivity of the secondary structures and are assigned automatically 
by the SSAP algorithm. Classes in the homologous superfamily level group together 
protein domains which are thought to share a common ancestor and can therefore be 
described as homologous. They are assigned automatically from the results of sequence 
comparisons and structure comparisons (using SSAP). 

Our comparisons are based on the assumption that identical CATH numbers will 
also imply some similarity of the TOPS diagrams for the corresponding proteins. The 
TOPS Atlas database [13], containing 2853 domains and based on clustering structures 
from the protein data bank [1] using the standard single linkage clustering algorithm at 
95% sequence similarity, was selected as the data set for this investigation. Structures 
with identical CATH numbers (to a given level) have been placed in one group and 
a maximal common pattern for this group has been computed. Then the pattern was 
matched against all structures in the selected subset and the quality q of the pattern, 
corresponding to positive predictive value, computed as follows; 

q = number of proteins in a given group / number of successful matches 
Thus, q = 1 corresponds to a good pattern (no false positives) and the value of q is 
lower for less good patterns. 

4.2 Results 

The experiments were performed using the CATH number identity at levels A, T, and 
H. The CATH number identity at the A level was clearly insufficient to guarantee any 
similarity at the TOPS diagram level; somewhat more surprising was the fact that iden- 
tity at the T (topological) level still produced noticeably weaker results than identity at 
the H level. Results for the latter are shown in Figure 0 Here the values of q for all 
domains from the data set (in lexicographical order by CATH numbers) are shown. The 
hrst 527 structures correspond to CATH class 1 (mainly a), the next 1048 to class 2 
(mainly /3), the following 1151 to class 3 (a — P) and the last 124 to class 4 (weak sec- 
ondary structure contents). As can be expected q values are small for class 4, since there 
is very little secondary structure information and also for class 1, since in mainly alpha 
domains there are few H-bonds and the corresponding TOPS diagrams contain little 
information about topology. Better q values can be observed for classes 2 and 3. Fig- 
ure0 shows q values (in light-grey) for class 3. Here the proteins have been reordered 
according to increasing q values. As can be seen, in about 36% of cases the q value is 1, 
i.e., the CATH number is uniquely defined by a TOPS pattern. Also, there are not many 
proteins with q values close to, but less than 1. Therefore, if a pattern has been shown 
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1 237 473 709 945 1181 1417 1653 1889 2125 2361 2597 2833 



Fig. 4. Quality of TOPS Patterns at CATH H Level. 




Fig. 5. Quality of TOPS Patterns for CATH Class 3. 



to be good for known proteins, it is likely that it will remain good for new, as yet un- 
classified, proteins. For comparison the figure also contains values (in dark-grey) where 
q values have been computed using only secondary-structure sequence patterns instead 
of complete TOPS diagrams. This demonstrates that good sequence patterns only exist 
for approximately 8% of structures. The superiority of sequence patterns for one group 
is caused hy different definitions of the largest pattern. 

Figure^ contains the same data as Figure 5, but initially ordered by pattern size as 
computed by the number of SSEs in the pattern, and then by q values. It can be seen that 
we start to get good q values when the number of SSEs reaches 7 or 8 (proteins with 



104 



Juris VIksna and David Gilbert 




1 101 201 301 401 501 601 701 801 901 1001 1101 

Fig. 6. Quality of TOPS Patterns for CATH Class 3 Ordered by the Size of Patterns. 



numbers from 459 or 531 on horizontal axis), and that q values are good in most cases 
when the number of SSEs reaches 1 1 (proteins with numbers from 800 on horizontal 
axis). Therefore, if a protein contains 7 or more SSEs, there is a good chance that it will 
have a good pattern and, if it contains 1 1 or more SSEs, then in most cases it will have 
a good pattern. 

Thus, the results obtained so far suggest that a database of pattern motifs could 
be quite useful for comparison of those proteins that have sufficiently rich secondary- 
structure content and especially for proteins with a large number of strands. This is not 
the largest subgroup of all proteins; however for this subgroup there are good chances 
that comparison with TOPS motifs will give biologically useful information. Of course, 
TOPS diagrams contain limited information about secondary structure; thus we can 
expect that motifs based on richer secondary structure models may give better results. 
At the same time the TOPS formalism has the advantage that all computations can be 
performed comparatively quickly. The exact computation times are very dependent on 
the given data, but in general we have observed that the comparison of a given protein 
against a database of about 1000 motifs requires less than 0.1 second on an ordinary 
600 MHz PC workstation. The discovery of motifs and associated evaluation via pattern 
matching over the TOPS Atlas has been done in about 2 hours on the same equipment. 

5 TOPS Patterns and Related Graph Problems 

The basic problems that arise in the TOPS formalism, namely, pattern matching and 
pattern discovery, can be easily reduced to that of subgraph isomorphism and maximal 
common subgraph problems in ordered graphs. We consider the following types of 
vertex-ordered labelled graphs. 
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5.1 Definitions 

A given graph G = (V,E) is vertex-ordered if there is a one-to-one mapping between 
the set of numbers {l, 2 ,...,|y|} and the set of vertices V. Let us call the number that 
corresponds tov€V the vertex position and denote it by p{v). We consider undirected 
graphs, thus we can assume that edges are defined by ordered pairs (w, w) with p{v) < 
p(w). Given a vertex-ordered graph, we define the edge order in the following way: 

p{{vi,wi)) < p{{v2,W2)) < 1 =^ p{vi) < p{v2) orp(ui) = p{v2) A p{wi) < p{w2) 

and assign to edges numbers { 1 , 2 ,..., |i?|} according to this order. We call these num- 
bers edge positions and denote them by p{e) . 

A graph G = (V, E) is vertex- (edge-) labelled with set of labels S, if there is given 
a function ly'.V — > S' (and, for edge labels, G: E S). We denote the label of a vertex 
V € V hy I (v) and the label of an edge e G E by 1(e). For given vertex-ordered and 
vertex- and edge-labelled graphs Gi = (Vi, Ei) and G2 = (V2, E2), we say that Gi is 
isomorphic to a subgraph of G 2 if there is an injective mapping I from Vi to V2 such 
that: 



- Vv, w G Vi, p(v) < p(w) p(I(v)) < p(I(v)) 

- yv,w G Vi, (v,w) G El (I(v),I(w)) G E2 

- Vv G Vi, l(v) = l(I(v)) 

- y(v,w) G El, l((v,w)) = l((I(v),I(w)) 

Since each edge is uniquely determined by two vertices we can extend the isomorphism 
I to edges by defining I((v, w)) = (I(v),I(w)). Then I preserves edge order just like 
it preserves vertex order, i.e., Vei,C2 G Ei,p(ei) < p(e2) p(I((^i)) < p(I(e2))- 

5.2 Graph Representation of TOPS Diagrams 

We can consider a TOPS diagram as a vertex ordered and vertex and edge labelled graph 
with the set of vertex labels Sv = {e-\-,e-,h-\-,h-"\ (up- or down-oriented strand or up- or 
down-oriented helix) and the set of edge labels S e = (F’, A, L, R, PL, PR, AL, AR} 
(parallel or antiparallel H-bonds or left- or right-oriented chiralities or a combination 
of H-bonds and chiralities). In practice, P edges are only permitted between e-t and e-t 
or e- and e- vertices, and A edges are allowed only between e-t and e- or e- and e-t 
vertices, but here for us this is not essential. 

For practical purposes it is also worth noting the complexity of graphs that have to 
be dealt with in TOPS formalism — the maximal number of vertices is around 50 and 
the number of edges is comparatively small and similar to the number of vertices. 

Let P be a TOPS pattern, Di and D2 be TOPS diagrams, and G(P), G(Di) and 
G(D2) be the graphs corresponding to these patterns or diagrams. Then the problem 
of checking whether TOPS pattern P will match diagram Z? 1 is equivalent to checking 
whether G(P) is isomorphic to a subgraph of G(Di). Similarly, the problem of finding 
a largest common pattern P of Di and D2 is equivalent with finding a largest common 
subgraph G(P) of G(Di) and G(Z?2)- 
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5.3 Complexity and Relation to Other Work 

First, it is easy to see that the subgraph isomorphism problem for vertex-ordered graphs 
remains NP-complete, since the maximal clique problem is NP complete, and this is 
not altered by vertex ordering. Also, the relatively small number of edges cannot be 
exploited to obtain polynomial algorithms, since in [3] and [15] similar graph structures 
are considered that are even simpler (the vertex degree is 0 or 1) and for such graphs 
the subgraph isomorphism problem is proved to be NP -complete. In [3] an algorithm 
is given that is polynomial with respect to the number of overlapping edges — however 
in TOPS this number tends to be quite large. 

There are several good non-polynomial algorithms for subgraph isomorphism, the 
two most popular being by Ullmann [ 1 2] and McGregor [9] . Although these are not eas- 
ily adaptable to vertex-ordered graphs, the vertex ordering seems to be the property that 
could considerably improve the algorithm efficiency. Our algorithm can be regarded as 
a variant of a method based on constraint satisfaction [9]; however there is an additional 
mechanism for periodically recomputing constraints. A very similar class of graphs has 
also been considered by I. Koch, T. Lengauer and E. Wanke in [8] where the authors de- 
scribe a maximal common subgraph algorithm based on searching for maximal cliques 
in a vertex product graph. This method seems to be applicable also for TOPS; however 
it is only practical for finding maximal common subgraphs for two graphs and is not 
directly useful for finding motifs for larger sets of proteins. 



6 Subgraph Isomorphism Algorithm for Ordered Graphs 

We have developed a subgraph isomorphism algorithm that exploits the fact that the 
graphs are vertex oriented. Initially, let us assume that we are dealing with graphs that 
are connected and do not contain isolated vertices (this set is also the most important 
in practice). Then an isomorphism mapping I is uniquely determined by defining the 
mapping of edges. 

The algorithm tries to match edges in the increasing order of edge positions and 
backtracks if for some edge match can not be found. Since the graphs are ordered, the 
positions in the target graph to which a given edge may be mapped and which have 
to be checked can only increase. Two additional ideas are used to make this process 
more efficient. Firstly, we assign a number of additional labels to vertices and edges. 
Secondly, if an edge e can not be mapped according to the existing mapping for previous 
edges, then the next place where this edge can be mapped according to the labels is 
found, and the minimal match positions of previous edges are advanced in order to be 
compatible with the minimal position of e. 

6.1 Labelling 

By definition vertices and edges are already assigned labels and le correspondingly 
that must be preserved by isomorphism mapping. Additionally we use an another kind 
of label for both vertices and edges, which we call Index. For a vertex v, Index(u) is a 
16-tuple of integers (containing twice as many elements as there are edge labels). The 
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ith element of Index(t;) is the number of edges (x, v) with le{{x, v)) equal to the ith 
possible value of 4 (according to some initially fixed order of labels). Similarly, the 
{k + i)th element of Index(t;) is the number of edges {v,x) with le{{v,x)) equal to 
the ith possible value of 4- Thus, the value Index(t;) encodes the numbers of incoming 
and outgoing edges of all possible types for a given vertex v. For an edge e = {v, w), 
Index(e) is a 4-tuple of integers {Si, S 2 , Ei, E 2 ), where Si is the number of edges 
{v, x) with p{x) < p{w), S 2 is the number of edges {v, x) with p{x) > p{w), Ei is 
the number of edges {y, w) with p{y) < p{w), and E 2 is the number of edges (y, w) 
with p{y) > p{w). The edge index describes how many shorter or longer other edges 
are connected to the endpoints of a given edge. For both vertices and edges we define 
lndex(a;) < Index(y) if the inequality holds between the all corresponding pairs of 
16-tuples (or 4-tuples). It is easy to see that for any vertex or edge x we must have 
Index(a;) < Index(/(a;)). 

6.2 Algorithm 

We assume that graphs are given as arrays PV, PE, TV and TE, where PV is an array 
of vertices in the pattern graph with PV[i] being the vertex v with p{v) = i, PE is an 
array of edges in the pattern graph with PE[i] being the edge e with p{e) = i, and 
TV and TE are similar arrays for the target graph. For an edge e of the pattern graph, 
list Matches (e) contains all possible positions (in increasing order) in target graph to 
which e can be matched according to vertex and edge labels and Index values. By 
Matches(e)[i] we denote the fth element from this list. The number Next(e) is the first 
position in Matches(e) list to which it still may be worth to try to match the edge. 
Initially for all edges we have Next(e) = 1. For vertex v the number Pos(ri) is the 
position in target graph to which vertex v is matched, otherwise we have Pos(ti) = 0. 

Algorithm Q] shows the main loop. Starting from the first edge the algorithm tries 
to find matches for all edges in increasing order and returns an array Pos of vertex 
mappings, if it succeeds. If for some edge a match consistent with matches for previous 
edges can not be found a procedure AdvanceEdgeMatchPositions is invoked, which 
tries to increase the values Next(e) for some of already matched edges and the matching 
process is continued starting from the first edge for which the value Next(e) has been 
changed. 

Procedure AdvanceEdgeMatchPositions uses a variant of depth-first search to find 
edges for which Next(e) can be increased. Alternative strategies are of course possible. 



6.3 Correctness 

The informal motivation why the algorithm correctly finds an isomorphic subgraph (or 
gives the answer that no isomorphic subgraph exists) is the following. First, as already 
noted above, for connected oriented graphs the isomorphism mapping is completely 
defined by defining the mapping for edges. For an isomorphism mapping it is sufficient 
to satisfy the labelling constraints on edge endpoints, preserve edge order and connec- 
tivity. If the AdvanceEdgeMatchPositions procedure is not used, the algorithm simply 
performs an exhaustive search of all mappings satisfying these constraints and either 
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procedure SubgraphlsomorphismInOrderedGraphsiPV, PE, TV, T E)\ 

begin 

foreach vertex e in PE do 

Compute the list Matches(e); 

Next(e) ^ 1; 

if Matches{e) = 0 then reium Not Isomorphic’, 

end 

foreach vertex v in PV do 
Pos(n) ^ 0; 

end 

fc ^ 1; 

while k < \PE\ do 

edge = {v, w) ^ PE[k]’, 

if Next(edge) > \Matches(edge)\ fhenveiumNot Isomorphic, 

else 

Find the smallest i > Next(edge) such that, for the target graph edge 
(vt, wt) = Matches(edge)[i] and for both vertices v and w, either Pos(u) = 
0 or Pos(ii) = vt and either Pos(w) = 0 or Pos(w) = wt (and 

p{{vt, wt)) > Matches(PiJ[fc — 1]) — if k > I, use Next(PE[k — 1])); 
if such an i is found then 
Next(edge) ^ i; 

Pos(n) <— vt’, Pos(w) <— wt’, 
fe ^ fc + 1; 
end 
else 

Find the smallest j > Next(edge) such that (if fc > 1) for the tar- 
get graph edge (vt,wt) — Matches(edge)[j] we have p{{vt,wt)) > 
Matches(PP[fc - l])[Next(PP[fc - 1])] 

(take j ^ Next(edge), if fc = 1); 

Next(edge) ^ j’, 

for all edges e in PE with p{e) < fc do 
Moved(e) ^ false ; 

end 

(vt,wt) <— Matches(edge)[j]; 

AdvanceEdgeMatchPositions{v,vty, 

Set fc to be the smallest value for which there is an edge e = (u2, w2) with 
p{e) = fc and either Pos(ii2) = 0 or Pos(w2) = 0; 

end 

end 

end 

return Pos (array of vertex mappings); 

end 



Algorithm 1 : Main Loop. 
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procedure AdvanceEdgeMatchPositions(v,vt); 

begin 

Pattern VertexStack «— 0; push (Pattern VertexStack,n); 

TargetVertexStack ^ 0; push (TargetVertexStack,iif); 
while PatternVertexStack 7 ^ 0 do 

pvert ^ pop (PatternVertexStack); tvert <— pop (TargetVertexStack); 
foreach edge e with p{e) < k, Moved(e) = false, and with endpoint pvert do 
Moved(e) ^ true ; 

Find the smallest i > Next(e) such that, for (vt2,wt2) = Matches(e)[i], we 
have wt2 > tvert (or vt2 > tvert, if pvert is the rightmost endpoint of e); 
if such an i is found then 
Next(e) ^ i\ 

Let newpvert he the other endpoint of e; 

newtvert ^ vt2 (or newtvert ^ wt2, if pvert is the rightmost endpoint of 
e); 

\t Pos(newpvert) f 0 then 
Pos(newpvert) ^ 0; 
push (Pattern VertexStack,newpvert) ; 
push (TargetVertexStack,newtvert); 

end 

end 

else 

return Not Isomorphic 

end 

end 

end 

end 



Algorithm 2: The Depth-First Search for Edges. 



hnds one, or returns an answer that no such mapping exists. If the AdvanceEdgeMatch- 
Positions procedure is included, then when invoked it receives a vertex v in pattern 
graph and the first vertex vt in target to which v may be mapped according to search 
performed so far. The constraints on edge mappings are then narrowed down to be con- 
sistent with the mapping requirement for vertex v. 



6.4 General Case of Disconnected Graphs 

To deal with graphs that may be disconnected (but do not have isolated vertices) we 
additionally have to check that the vertex positions are preserved by isomorphism map- 
ping, i.e., for vertices v and w in pattern graph with p{y) < p{w) we must have 
p{I{v)) < p{I{w)). If we have isolated vertices, we additionally have to check that 
the sequence of vertices between v and w is a substring of the sequence of vertices 
between I{v) and I{w). This additional checking can be easily incorporated into the 
algorithm. 
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7 Maximal Common Subgraph Problem 

The subgraph isomorphism algorithm is very fast for graphs corresponding to TOPS 
diagrams. This permits finding maximal common subgraphs by repeated extension and 
checking for subgraph isomorphism. 

In order to find the maximal common subgraph for a given set of graphs we ba- 
sically use an exhaustive search. Starting a the simple (one vertex) pattern graph, we 
check for subgraph isomorphism against all graphs in a given set and in the case of 
success attempt to extend the already matched pattern graph in all possible ways. Some 
restrictions on the number of different types of edges and vertices can be deduced from 
the given set of target graphs and are used by the algorithm. Apart from that, the pre- 
vious successful match may be used to deduce information about extensions which are 
more likely to be successful in the next match. In general this does not prune the search 
space but may help to discover large common subgraphs earlier. There is also a greater 
probability that the largest common subgraph is found within a given time limit, even 
when the search has not been completed. 

The advantage of this approach is that we obtain an algorithm with time complexity 
that is linear with respect to the number of graphs in a given set. Since there are likely to 
be more restrictions on the pattern for larger sets, often the most difficult cases arise for 
sets containing only one graph — however in this case we can simply return the given 
graph as the maximal common subgraph. Other methods that are known (for example 
as described in [8]) may be more efficient for sets containing a small number (basically 
just two) of graphs, but in general cannot be used to hnd the exact answer to the problem 
for larger sets. 

Experiments suggest that this approach is still practical for TOPS diagrams. As 
mentioned above in the results section, all motifs in the Atlas for the C ATH level H have 
been found by using repeated pattern matching and extension in 2 hours on an ordinary 
PC workstation. However it seems that the size of TOPS diagrams is quite close to 
the limit up to which such a maximal common subgraph algorithm can be successfully 
used, thus our solution may be quite problem specihc. At the same time we expect 
that the subgraph isomorphism algorithm may be adapted also for considerably larger 
structures and may be useful for the other problems in bioinformatics. 
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Abstract. We develop fast algorithms for computing the linking number of a 
simplicial complex within a filtration. We give experimental results in applying 
our work toward the detection of non-trivial tangling in biomolecules, modeled 
as alpha complexes. 



1 Introduction 



In this paper, we develop fast algorithms for computing the linking numbers of simpli- 
cial complexes. Our work is within a framework of applying computational topology 
methods to the fields of biology and chemistry. Our goal is to develop useful tools by 
researchers in computational structural biology. 



Motivation and Approach. In the 1980’s, it was shown that the DNA, the molecular 
structure of the genetic code of all living organisms, can become knotted during repli- 
cation m- This finding initiated interest in knot theory among biologists and chemists 
for the detection, synthesis, and analysis of knotted molecules The impetus for 
this research is that molecules with non-trivial topological attributes often display ex- 
otic chemistry. Taylor recently discovered a figure-of-eight knot in the structure of a 
plant protein by examining 3,440 proteins using a computer program [^]. Moreover, 
chemical self-assembly units have been used to create catenanes, chains of interlocking 
molecular rings, and rotaxanes, cyclic molecules threaded by linear molecules. Re- 
searchers are building nanoscale chemical switches and logic gates with these struc- 
tures 1 2131 . Eventually, chemical computer memory systems could be built from these 
building blocks. 

Catenanes and rotaxanes are examples of non-trivial structural tanglings. Our work 
is on detecting such interlocking structures in molecules through a combinatorial 
method, based on algebraic topology. We model biomolecules as a sequence of alpha 
complexes O- The basic assumption of this representation is that an alpha-complex 
sequence captures the topological features of a molecule. This sequence is also a fil- 
tration of the Delaunay triangulation, a well-studied combinatorial object, enabling the 
development of fast algorithms. 
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The focus of this paper is the linking number. Intuitively, this invariant detects if 
components of a complex are linked and cannot be separated. We hope to eventually in- 
corporate our algorithm into publicly available software as a tool for detecting existence 
of interlocked molecular rings. 

Given a filtration, the main contributions of this paper are: 

(i) the extension of the definition of the linking number to graphs, using a canonical 
basis, 

(ii) an algorithm for enumerating and generating all cycles and their spanning surfaces 
within a filtration, 

(iii) data structures for efficient enumeration of co-existing pairs of cycles in different 
components, 

(iv) an algorithm for computing the linking number of a pair of cycles, 

(v) and the implementation of the algorithms and experimentation on real data sets. 

Algorithm (iv) is based on spanning surfaces of cycles, giving us an approximation to 
the linking number in the case of non-orientable or self-intersecting surfaces. Such cases 
do not arise often in practice, as shown in Section 0 However, we note in SectionQthat 
the linking number of a pair may be also computed by alternate algorithms. Regardless 
of the approach taken, pairs of potentially linked cycles must be first detected and enu- 
merated. We provide the algorithms and data structures of such enumeration in (i-iii). 

Prior Work. Important knot problems were shown to be decidable by Haken in his 
seminal work on normal surfaces na. This approach, as reformulated by Jaco and 
others forms the basis of many current knot detection algorithms. Haas et al. 
recently showed that these algorithms take exponential time in the number of crossings 
in a knot diagram II12II . They also placed both the UNKNOTTING PROBLEM and the 
SPLITTING PROBLEM in NP, the latter being the focus of our paper. Generally, other 
approaches to knot problems have unknown complexity bounds, and are assumed to 
take at least exponential time. As such, the state of the art in knot detection only allows 
for very small data sets. We refer to Adams background in knot theory. 

Three-dimensional alpha shapes and complexes may be found in Edelsbrunner and 
Miicke m- We modify the persistent homology algorithm to compute cycles and sur- 
faces |6). We refer to Munkres Id for background in homology theory that is accessi- 
ble to non-specialists. 

Outline. The remainder of this paper is organized as follows. We review linking num- 
bers for collections of closed curves, and extend this notion to graphs in R ^ in SectionEl 
We describe our model for molecules in Sectional Extending the persistence algorithm, 
we design basic algorithms in Section0and use them to develop an algorithm for com- 
puting linking numbers in Section 0 We show results of some initial experiments in 
Section0 concluding the paper in Section Q 

2 Linking Number 

In this section, we define links and discuss two equivalent definitions of the linking 
number. While the first definition provides intuition, the second definition is the basis 
of our computational approach. 
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Links. A knot is an embedding of a circle in three-dimensional Euclidean space, k : 
^ Two knots are equivalent if there is an ambient isotopy that maps the first to 
the second. That is, we may deform the first to the second by a continuous motion that 
does not cause self-intersections. A link I is a collection of knots with disjoint images. 
A link is separable (splittable) if it can be continuously deformed so that one or more 
components can he separated from other components hy a plane that itself does not 
intersect any of the components. We often visualize a link I by a link diagram, which is 
the projection of a link onto a plane such that the over- and under-crossings of knots are 
presented clearly. We give an example in Figure da). For a formal definition, see in. 




(a) A Link Diagram for the Whitehead (b) Crossing Label Convention. 

Link. 



Fig. 1. The Whitehead link (a) is labeled according to the convention (h) that the cross- 
ing label is 4-1 if the rotation of the overpass by 90 degrees counter-clockwise aligns its 
direction with the underpass, and —1 otherwise. 



Linking Number. A knot (link) invariant is a function that assigns equivalent objects 
to equivalent knots (links.) Seifert first defined an integer link invariant, the linking 
number, in 1935 to detect link separability [CHI- Given a link diagram for a link I, we 
choose orientations for each knot in 1. We then assign integer labels to each crossing 
between any pair of knots k, k' , following the convention in Figure QJb)- Let A(fc, k') 
of the pair of knots to be one half the sum of these labels. A standard argument using 
Reidermeister moves shows that A is an invariant for equivalent pairs of knots up to 
sign Id. The linking number X(l) of a link I is 

\{i)= 

k^k’£l 

We note that X(l) is independent of knot orientations. Also, the linking number does 
not completely recognize linking. The Whitehead link in Figure for example, has 
linking number zero, but is not separable. 

Surfaces. The linking number may be equivalently defined by other methods, including 
one based on surfaces anii.A spanning surface for a knot k is an embedded surface with 
boundary k. An orientable spanning surface is a Seifert surface. Because it is orientable. 
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Fig. 2. The Hopf link and Seifert surfaces of its two unknots are shown on the left. 
Clearly, A = 1. This link is the 200th complex for data set H in Section 0 The span- 
ning surface produced for the cycle on the right is a Mobius strip and therefore non- 
orientable. 



we may label its two sides as positive and negative. We show examples of spanning 
surfaces for the Hopf link and Mobius strip in Figure 0 Given a pair of oriented knots 
k,k', and a Seifert surface s for fc, we label s by using the orientation of k. We then 
adjust k' via a homotopy h until it meets s in a finite number of points. Following along 
k' according to its orientation, we add +1 whenever k' passes from the negative to the 
positive side, and —1 whenever k' passes from the positive to the negative side. The 
following lemma asserts that this sum is independent of our the choice of h and s, and 
it is, in fact, the linking number. 

Seifert Sureace Lemma. A(fc, fc') is the sum of the signed intersections between 
fc' and any Seifert surface for fc. 

The proof is by a standard Seifert surface construction 1I7II . If the spanning surface is 
non-orientable, we can still count how many times we pass through the surface, giving 
us the following weaker result. 

Spanning Sureace Lemma. A(fc, fc') (mod 2) is the parity of the number of times 
fc' passes through any spanning surface for fc. 

Graphs. We need to extend the linking number to graphs, in order to use the above 
lemma for computing linking numbers for simplicial complexes. Let G = {V,E),E C 
(^) be a simple undirected graph in with c components , . . . , G“. Let zi, . . . , Zm 
be a fixed basis for the cycles in G, where m = |i? | — 1 1^ | + c. We then define the linking 
number between two components of G to be A(G*, G^) = |A(zp, Zq)\ for all cycles 
Zp, Zq in G% GG respectively. The linking number of G is then defined by combining 
the total interaction between pairs of components: 

A(G) = ^A(G',G^). 

i¥^3 

The linking number is computed only between pairs of components following Seifert’s 
original definition. Linked cycles within the same component may be easily unlinked 



116 



Herbert Edelsbrunner and Afra Zomorodian 



by a homotopy. Figure 0 shows that the linking number for graphs is dependent on the 
chosen basis. While it may seem that we want A(G) = 1 in the figure, there is no clear 



g‘ 





X,= l 



X = 2 



Fig. 3. We get different A(G) for graph G (top) depending on our choice of basis for 
G^ : two small cycles (left) or one large and one small cycle (right.) 



answer in general. We will define a canonical basis in Section 0 using the persistent 
homology algorithm to compute A(G) for simplicial complexes. 



3 Alpha Complexes 

Our approach to analyzing a topological space is to assume a hltration for such a space. 
A filtration may be viewed as a history of a growing space that is undergoing geometric 
and topological changes. While filtrations may be obtained by various methods, only 
meaningful filtrations give meaningful linking numbers. As such, we use alpha com- 
plex hltrations to model molecules. The alpha complex captures the connectivity of a 
molecule that is represented by a union of spheres. This model may be viewed as the 
dual of the space hlling model for molecules III dll . 

Dual Complex. A spherical ball u = (u, U^) S x R is dehned by its center u and 
square radius U'^. If C/^ < 0, the radius is imaginary and so is the ball. The weighted 
distance of a point x from a ball ft is tt* (a;) = \\x — u\\^ — U'^. Note that a point x G R.^ 
belongs to the ball iff tTu{x) < 0, and it belongs to the bounding sphere iff tTu{x) = 0. 
Let S' be a finite set of balls. The Voronoi region of u G S is the set of points for which 
ii minimizes the weighted distance, 

14 = {a; G R^ I TTu{x) < TTy{x),yv G S}. 

The Voronoi regions decompose the union of balls into convex cells of the form uC\Vu, 
as illustrated in Figure 0 Any two regions are either disjoint or they overlap along 
a shared portion of their boundary. We assume general position, where at most four 
Voronoi regions can have a non-empty common intersection. Let T C S have the prop- 
erty that its Voronoi regions have a non-empty common intersection, and consider the 
convex hull of the corresponding centers, ax — conv{u \ u G T}. General position 
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Fig. 4. Union of nine disks, convex decomposition using Voronoi regions, and dual 
complex. 



implies that ctt is a d-dimensional simplex, where d = cardT — 1. The dual complex 
of S is the collection of simplices constructed in this manner, 

iL = {ar I T C ,5, f| (u n Ui) ^ 0}. 

mGT 

Any two simplices in K are either disjoint or they intersect in a common face which is a 
simplex of smaller dimension. Furthermore, if cr G AT, then all faces of a are simplices 
in K. A set of simplices with these two properties is a simplicial complex [H3. A 
subcomplex is a subset L C K that is itself a simplicial complex. 

Alpha Complex. A filtration ordering is an ordering of a set of simplices such that 
each prefix of the ordering is a subcomplex. The sequence of subcomplexes defined by 
taking successively larger prefixes is fhe corresponding^/tratfon. For dual complexes 
of a collection of balls, we generate an ordering and a filtration by literally growing the 
balls. For every real number G R, we increase the square radius of a ball ft by a 
giving us u{a) = {u, + a^). We denote the collection of expanded balls u{a) as 

S{a). If = 0, then a is the radius of u{a). If < 0, then a is imaginary, and so is 
the ball u{a). The a-complex K (a) of S is the dual complex of S'(o;) [Q- For example, 
K{—oo) = 0, K{0) = K, and AT(oo) = D is the dual of the Voronoi diagram, also 
known as the Delaunay triangulation of S. For each simplex a G D, there is a unique 
birth time a^{cr) defined such thaf a G K{a) iff > a^{a). We order the simplices 
such that a'^(a) < o^{t) implies a precedes r in the ordering. More than one simplex 
may be born at a time and such cases may arise even if S is in general position. In 
the case of a tie, it is convenient to order lower-dimensional simplices before higher- 
dimensional ones, breaking remaining ties arbitrarily. We call the resulting sequence 
the age ordering of the Delaunay triangulation. 

Modeling Molecules. To model molecules by alpha complexes, we use representations 
of molecules as unions of balls. Each ball is an atom, as defined by ifs position in 
space and its van der Waals radius. These atoms become the spherical balls we need 
to define our complexes. Our representation gives us a filtration of alpha complexes 
for each molecule. We compute a linking number for each complex in a filtration of 
m complexes. Let [m] denote the set {1, 2, . . . , m}. Then, the linking number may be 
viewed as a signature function A : [m] Z that maps each index i G [to] to an integer 
\{i) G Z. For other signature functions for filtrations of alpha complexes, see [EQ. 
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4 Basis and Surfaces 

To compute the linking numbers for an alpha complex, we need to recognize cycles, 
establish a basis for the set of cycles, and find spanning surfaces for the basis cycles. 
We do so by extending an algorithm we developed for computing persistent homol- 
ogy iBj ■ We dispense with defining persistence and concentrate on the algorithm and its 
extension. 

Homology. We use homology to define cycles in a complex. Homology partitions cy- 
cles into equivalence classes using the boundary class of bounding cycles as the null 
element of a quotient group in each dimension. We use Z 2 homology, so the group 
operation, which we call addition, is symmetric difference. Addition allows us to com- 
bine sets of simplices in a way that eliminates shared boundaries, as shown in Figure 0 
Intuitively, non-bounding 1 -cycles correspond to the graph notion of a cycle. We need to 




Fig. 5. Symmetric difference in dimensions one and two. We add two 1 -cycles to get 
a new 1 -cycle. We add the surfaces the cycles bound to get a spanning surface for the 
new 1 -cycle. 



define a basis for the first homology group of the complex which contains all 1 -cycles, 
and choose representatives for each homology class. We use these representatives to 
compute linking numbers for the complex. 

A simplex of dimension d in a filtration either creates a d-cycle or destroys a (d— 1)- 
cycle by turning it into a boundary. We mark simplices as positive or negative, accord- 
ing to this action In particular, edges in a filtration which connect components 
are marked as negative. The set of all negative edges gives us a spanning tree of the 
complex, as shown in Figure^ We use this spanning tree to define our canonical basis. 




Fig. 6. Solid negative edges combine to form a spanning tree. The dashed positive edge 
(Ti creates a canonical cycle. 



Every time a positive edge at is added to the complex, it creates a new cycle. We choose 
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the unique cycle that contains Ui and no other positive edge as a new basis cycle. We 
call this cycle a canonical cycle, and the collection of canonical cycles, the canonical 
basis. We use this basis for computation. 

Persistence. The persistence algorithm matches positive and negative simplices to find 
life-times of homological cycles in a filtration. The algorithm does so by following a 
representative cycle z for each class. Initially, z is the boundary of a negative simplex 
(Tj, as z must lie in the homology class aj destroys. The algorithm then successively 
adds class-preserving boundary cycles to z until it finds the matching positive simplex 
cTi, as shown in FigureQ We call the half-open interval \i, j) the persistence interval of 




Fig. 7. Starting from the boundary of the negative triangle a j, the persistence algorithm 
finds a matching positive edge ai by finding the dashed 1 -cycle. We modify this 1 -cycle 
further to find the solid canonical 1 -cycle and a spanning surface. 



both the homology class and its canonical representative. During this interval, the ho- 
mology class exists as a class of homologous non-boundings cycles in the filtration. As 
such, the class may only affect the linking numbers of complexes Ki, . . . , Kj-i in the 
filtration. We use this insight in the next section to design an algorithm for computing 
linking numbers. 

Computing Canonical Cycles. The persistence algorithm halts when it finds the match- 
ing positive simplex ai for a negative simplex aj , often generating a cycle z with mul- 
tiple positive edges and multiple components. We need to convert z into a canonical 
cycle by eliminating all positive edges in z except for We call this process canon- 
ization. To canonize a cycle, we add cycles associated with unnecessary positive edges 
to z successively, until z is composed of ai and negative edges, as shown in Figure 01 
Canonization amounts to replacing one homology basis element with a linear combina- 
tion of other elements in order to reach the unique canonical basis we defined earlier. 
A cycle undergoing canonization changes homology classes, but the rank of the basis 
never changes. 

Computing Spanning Surfaces. For each canonical cycle, we need a spanning surface 
in order to compute linking numbers. We may compute these by maintaining surfaces 
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while computing the cycles. Recall that initially, a cycle representative is the boundary 
of a negative simplex dj. We use aj as the initial spanning surface for 2 ;. Every time 
we add a cycle y to z in the persistence algorithm, we also add the surface y bounds to 
the z’s surface. We continue this process through canonization to produce both canon- 
ical cycles and their spanning surfaces. Here, we are using a crucial property of our 
filtrations: the final complex is always the Delaunay complex of the set of weighted 
points and does not contain any 1 -cycles. Therefore, all 1 -cycles are eventually turned 
to boundaries and have spanning surfaces. 

If the generated spanning surface is Seifert, we may apply the SEIFERT SURFACE 
Lemma to compute the linking numbers. In some cases, however, the spanning surface 
is not Seifert, as in Figure|3b). In these cases, we may either compute the linking num- 
ber modulo 2 by applying the SPANNING SURFACE Lemma, or compute the linking 
number by alternative methods. 

5 Algorithm 

In this section, we use the basis and spanning surfaces computed for 1 -cycles to find 
linking numbers for all complexes in a filtration. Since we focus on 1 -cycles only, we 
will refer to them simply as cycles. 

Overview. We assume a filtration K\, K 2 , . . . , as input, which we alternately view 
as a single complex undergoing growth. As simplices are added, the complex undergoes 
topological changes which affect the linking number: new components are created and 
merged together, and new non-bounding cycles are created and eventually destroyed. 
We use a basic insight from the last section: a basis cycle z with persistence interval 
[i,j) may only affect the linking numbers of complexes in the 

filtration. Consequently, we only need to consider basis cycles z' that exist during some 
subinterval [u, v) C [i,j) in a different component than z’s. We call the pair z, z' a 
potentially-linked (p-linked) pair of basis cycles, and the interval [u,v) the p-linking 
interval. 

Focusing on p-linked pairs, we get an algorithm with three phases. In the first phase, 
we compute all p-linked pairs of cycles. In the second phase, as shown in Figure 0 we 
compute the linking numbers of such pairs. In the third and final phase, we aggregate 
these contributions to find the linking number signature for the filtration. 



for each p-linked pair Zp, Zq with interval [m, v) do 
Compute A = |A(zp, Zq)\ ; 

Output (A, [m, v)) 
endf or. 



Fig. 8. Linking Number Algorithm. 



Two cycles Zp, Zq with persistence intervals \ip,jp)^ \iq,jq) co-exist during [r, s) = 
[*P> Jp) C [iq,jq). We need to know if these cycles also belong to different components 
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during some sub-interval [u, v) C [r, s). Let tp^q be the minimum index in the filtration 
when Zp and are in the same component. Then, [u,u) = [r, s)n[0, If [u, u) ^ 0, 
Zp, Zq are p-linked during that interval. In the remainder of this section, we will hrst 
develop a data structure for computing fp g for any pair of cycles Zp,Zq. Then, we use 
this data structure to efficiently enumerate all pairs of p-linked cycles. Finally, we give 
an algorithm for computing A(zp, Zq) for a p-linked pair of cycles Zp, Zq. 

Component History. To compute fp^g, we need to have a history of the changes to the set 
of components in a filtration. There are two types of simplices that can change this set. 
Vertices create components and are therefore all positive. Negative edges connect com- 
ponents. We construct a binary tree called component tree recording these changes using 
a union-find data structure D. The leaves of the component tree are the vertices of the 
hltration. When a negative edge connects two components, we create an internal node 
and connect it to the nodes representing these components, as shown in Figure 0The 




Fig. 9. The union-find data structure (left) has vertices as nodes and negative edges as 
edges. The component tree (right) has vertices as leaves and negative edges as internal 
nodes. 

component tree has size 0{n) for n vertices, and we construct it in time 0{nA~^{n)), 
where A~^{n) is the inverse of the Ackermann’s function which exhibits insanely slow 
growth. Having constructed the component tree, we find the time two vertices w, x are 
in the same component by finding their lowest common ancestor (lea) in this tree. We 
utilize Harel and Tarjan’s optimal method to hnd lea’s with 0{n) preprocessing time 
and 0(1) query time lITTl . Their method uses bit operations. If such operations are not 
allowed, we may use van Leeuwen’s method with the same preprocessing time and 
0(log log n) query time lEOl . 

Enumeration. Having constructed the component tree, we use a modihed union-find 
data structure to enumerate all pairs of p-linked cycles. We augment the data structure 
to allow for quick listing of all existing canonical cycles in each component in K i. 
Our augmentation takes two forms; we put the roots of the disjoint trees, representing 
components, into a circular doubly-linked list. We also store all existing cycles in each 
component in a doubly-linked list at the root node of the component, as shown in Fig- 
ure Q2I When components merge, the root a; i of one component becomes the parent 
of the root X 2 of the other component. We concatenate the lists stored at the a;i,a; 2 , 
store the resulting list at x\, and eliminate X 2 from the circular list in 0(1) time. When 
cycle Zp is created at time i, we first find Zp’s component in time 0{A~^{n)). Then, we 
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Fig. 10. The augmented union-find data structure places root nodes in the shaded circu- 
lar doubly-linked list. Each root node stores all active canonical cycles in that compo- 
nent in a doubly-linked list, as shown for the darker component. 



store Zp at the root of the component and keep a pointer to Zp with simplex aj, which 
destroys Zp. This implies that we may delete Zp from the data structure at time j with 
constant cost. 

Our algorithm to enumerate p-linked cycles is incremental. We add and delete cycles 
using the above operations from the union-find forest, as the cycles are created and 
deleted in the filtration. When a cycle Zp is created at time i, we output all p-linked 
pairs in which Zp participates. We start at the root which now stores Zp and walk around 
the circular list of roots. At each root x, we query the component tree we constructed 
in the last subsection to find the time t when the component of x merges with that of 
Zp. Note that t = tp^q for all cycles Zq stored at x. Consequently, we can compute the 
p-linking interval for each pair Zp, Zq to determine if the pair is p-linked. If the filtration 
contains P p-linked pairs, our algorithm takes time 0{mA~^{n) -f P), as there are at 
most m cycles in the filtration. 

Orientation. In the previous section, we showed how one may compute spanning sur- 
faces Sp, Sq for cycles ^p, Zq, respectively. To compute the linking number using our 
lemma, we need to orient either the pair Sp, Zq or Zp, Sq. Orienting a cycle is trivial: 
we orient one edge and walk around to orient the cycle. If either surface has no self- 
intersections, we may easily attempt to orient it by choosing an orientation for an arbi- 
trary triangle on the surface, and spreading that orientation throughout. The procedure 
either orients the surface or classifies it as non-orientable. We currently do not have an 
algorithm for orienting surfaces with self-intersections. The main difficulty is distin- 
guishing between two cases for a self-intersection: a surface touching itself and passing 
through itself. 

Computing A. We now show how to compute \{zp, Zq) for a pair of p-linked cycles 
Zp,Zq, completing the description of our algorithm in Figure 0 We assume that we 
have oriented Sp, Zq for the remainder of this subsection. 

Let the star of a vertex u St it be the set of simplices containing it as a vertex. We 
subdivide the complex via a barycentric subdivision by connecting the centroid of each 
triangle to its vertices and midpoints of its edges, subdividing the simplices accordingly. 
This subdivision guarantees that no edge uv will have both ends on a Seifert surface 
unless it is entirely contained in that surface. We note that this approach mimics the 
construction of regular neighborhoods for complexes m. 
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For a vertex u S Sp, the edge property guaranteed by subdivision enables us to mark 
each edge uv G St u,v ^ Sp as positive or negative, depending on the location of v with 
respect to Sp. After marking edges, we walk once around Zq, starting at a vertex not on 
Sp. If such a vertex does not exist, then X{zp, Zq) = 0. Otherwise, we create a string 
Sp^q of + and — characters by noting the marking of edges during our walk. S'p g has 
even length as we start and end our walk on a vertex not on s p, and each intersection of 
Zq with Sp produces a pair of characters, as shown in Figure the left. If Sp^q is the 




Fig. 11. On the left, starting at v, we walk on Zq according to its orientation. Segments 
of Zq that intersect Sp are shown, along with their contribution to Sp^q = “ + + + + + 

”. We get \{zp, Zq) = — 1. On the right, the bold flip curve is the border of s + 

and Sp , the portions of Sp that are oriented differently. Sp^q = “ + H 1 ”, 

so counting all +’s, we get A(zp, Zq) mod 2 = 3 mod 2 = 1. 



empty string, Zq never intersects Sp and A(zp, Zq) = 0. Otherwise, Zq passes through Sp 
for pairs H — and — h, corresponding to Zq piercing the positive or negative side of Sp, 
respectively. Scanning Sp^q from left to right in pairs, we add +1 for each occurrence 

of — h, — 1 for each H — , and 0, for each ++ or . Applying the SEIFERT SURFACE 

Lemma in Section|21 we see that this sum is A(zp, Zq). 

Computing A mod 2. If neither of the spanning surfaces Sp,Sq of the two cycles zi, Z 2 
is Seifert, we may still compute X(zi,Z2) mod 2 by a modifled algorithm, provided 
one surface, say Sp, has no self-intersections. We choose an orientation on Sp locally, 
and extend it until all the stars of the original vertices are oriented, are oriented. This 
orientation will not be consistent globally, resulting in pair of adjacent vertices in Sp 
with opposite orientations. We call the implicit boundary between vertices with opposite 
orientations a flip curve, as shown in bold in Figure o to the right. When a cycle 
segment crosses the flip curve, orientation changes. Therefore, in addition to noting 
marked edges, we add a + to the string Sp^q every time we cross a flip line. To compute 
A(zp, Zq) mod 2, we only count +’s in Sp^q and take the parity as our answer. 
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If Sp is orientable, there are no flip curves on it. The contribution of cycle segments 
to the string is the same as before: H — or — h for segments that pass through Sp, and 

++ and for segments that do not. By counting +’s, only segments that pass through 

Sp change the parity of the sum for A. Therefore, the algorithm computes A mod 2 
correctly for orientable surfaces. For the orientable surface on the right in Figure cn for 
instance, we get X{zp, Zq) mod 2 = 5 mod 2 = 1, which is equivalent to the parity of 
the answer computed by the previous algorithm. 

Remark. We are currently examining the question of orienting surfaces with self- 
intersections. Using our current methods, we may obtain a lower bound signature for 
A by computing a mixed sum: we compute A and A mod 2 whenever we can to obtain 
the approximation. We may also develop other methods, including those based on the 
projection definition of the linking number in Section |2 

6 Experiments 

In this section, we present some experimental timing results and statistics which we 
used to guide our algorithm development. We also provide visualizations of basis cycles 
in a filtration. All timings were done on a Micron PC with a 266 MHz Pentium II 
processor and 128 MB RAM running Solaris 8. 

Implementation. We have implemented all the algorithms in the paper, except for the 
algorithm for computing A mod 2. Our implementation differs from our exposition in 
three ways. The implemented component tree is a standard union-find data structure 
with the union by rank heuristic, but no path compression [il- Although this structure 
has a 0{n log n) construction time and a 0(log n) query time, it is simple to implement 
and extremely fast in practice. We also use a heuristic to reduce the number of p-linked 
cycles. We store bounding boxes at the roots of the augmented union-find data structure. 
Before enumerating p-linked cycles, we check to see if the bounding box of the new 
cycle intersects with that of the stored cycles. If not, the cycles cannot be linked, so we 
obviate their enumeration. Finally, we only simulate the barycentric subdivision. 

Data. We have experimented with a variety of data sets and show the results for six rep- 
resentative sets in this section. The first data set contains points regularly sampled along 
two linked circles. The resulting filtration contains a complex which is a Hopf link, as 
shown in Figure |2l The other data sets represent molecular structures with weighted 
points. In each case, we first compute the weighted Delaunay triangulation and the age 
ordering of that triangulation. The data points become vertices or 0-simplices. Table ^ 
gives the sizes of the data sets, their Delaunay triangulations, and age orderings. We 
show renderings of specific complexes in the filtration for data set K in Figure ^3 

Basis. Table El summarizes the basis generation process. We distinguish the two steps 
of our algorithm: initial basis generation and canonization. We give the number of basis 
cycles for the entire filtration, which is equal to the number of positive edges. We also 
show the effect of canonization on the size of the cycles and their spanning surfaces 
in Table El Note that canonization increases the size of cycles by one or two orders of 
magnitude. This is partially the reason we try to avoid performing the link detection if 
possible. 
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Table 1. H defines a Hopf link. G is Gramicidin A, a small protein. M is a protein 
monomer. Z is a portion of a periodic zeolite structure. K is a human cyclin-dependent 
kinase. D is a DNA tile. 





# simplices of dimension d 


total 


0 


1 


2 


3 


H 


100 


1,752 


3,240 


1,587 


6,679 


G 


318 


2,322 


3,978 


1,973 


8,591 


M 


1,001 


7,537 


13,018 


6,481 


28,037 


Z 


1,296 


11,401 


20,098 


9,992 


42,787 


K 


2,370 


17,976 


31,135 


15,528 


67,009 


D 


1,11A 


60,675 


105,710 


52,808 


226,967 




Fig. 12. Complex ATsies of K has two components and seventeen cycles. The spanning 
surfaces are rendered transparently. 



Links. In Table |3 we show that our component tree and augmented trees are very fast 
in practice to generate p-linked pairs. We also show that our bounding box heuristic 
for reducing the number of p-linked pairs increases the computation time negligibly. 
The heuristic is quite successful, moreover, in reducing the number of pairs we have to 
check for linkage, eliminating 99.8% of the candidates for dataset Z. The differences 
in total time of computation reflect the basic structure of the datasets. Dataset D has 
a large computation time, for instance, as the average size of the p-linked surfaces is 
approximately 264.16 triangles, compared to about 1.88 triangles for dataset K, and 
about 1.73 triangles for dataset M. 

Discussion. Our initial experiments demonstrate the feasibility of the algorithms for 
fast computation of linking. The experiments fail to detect any links in the protein data, 
however. This is to be expected, as a protein consists of a single component, the pri- 
mary structure of a protein being a single polypeptide chain of amino acids. Links, on 
the other hand, exist in different components by definition. We may relax this defini- 
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Table 2. On the left, we give the time to generate and canonize basis cycles, as well as 
their number. On the right, we give the average length of cycles and size of surfaces, 
before and after canonization. 





time in seconds 


# cycles 


generate 


canonize 


total 


H 


0.08 


0.04 


0.12 


1,653 


G 


0.08 


0.03 


0.11 


2,005 


M 


0.28 


0.20 


0.48 


6,537 


Z 


0.46 


0.46 


0.92 


10,106 


K 


0.72 


1.01 


1.73 


15,607 


D 


2.63 


2.94 


5.57 


52,902 





avg cycle length 


avg surface size 




before 


after 


before 


after 


H 


3.06 


51.03 


1.06 


63.04 


G 


3.26 


13.02 


1.38 


52.28 


M 


3.29 


34.18 


1.33 


71.18 


Z 


4.71 


25.33 


3.26 


117.81 


K 


3.48 


67.87 


1.62 


166.70 


D 


3.46 


39.94 


1.81 


158.99 



Table 3. Time to construct the component tree, and the computation time and number of 
p-linked pairs (alg), p-linked pairs with intersecting bounding boxes (heur), and links. 





tree 


time in seconds 


# 


pairs 




alg 


heur 


links 


alg 


heur 


links 


H 


0.01 


0.00 


0.00 


0.01 


1 


1 


1 


G 


0.00 


0.01 


0.02 


0.02 


112 


0 


0 


M 


0.03 


0.06 


0.06 


0.23 


16,503 


14,968 


0 


Z 


0.04 


0.07 


0.07 


0.13 


169,594 


308 


0 


K 


0.06 


0.13 


0.16 


0.36 


12,454 


11,365 


0 


D 


0.27 


0.56 


0.82 


8.22 


98,522 


4,448 


0 



tion easily, however, to allow for links occurring in the same component. We have im- 
plementations of algorithms corresponding to this relaxed definition. Our future plans 
include looking for links in proteins from the Protein Data Bank [im. Such links could 
occur naturally as a result of disulphide bonds between different residues in a protein. 



7 Conclusion 

In this paper, we develop algorithms for finding the linking numbers of a filtration. 
We give algorithms for computing bases of 1 -cycles and their spanning surfaces in 
simplicial complexes, and enumerating co-existing cycles in different components. In 
addition, we present an algorithm for computing the linking number of a pair of cycles 
using the surface formulation. Our implementations show that the algorithms are fast 
and feasible in practice. By modeling molecules as filtrations of alpha complexes, we 
can detect potential non-trivial tangling within molecules. Our work is within a frame- 
work for applying topological methods for understanding molecular structures. 
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Abstract. An important aspect of homology modeling and protein design al- 
gorithms is the correct positioning of protein side chains on a hxed backbone. 
Flomology modeling methods are necessary to complement large scale structural 
genomics projects. Recently it has been shown that in automatic protein design 
it is of the uttermost importance to hnd the global solution to the side chain po- 
sitioning problem Q. If a suboptimal solution is found the difference in free 
energy between different sequences will be smaller than the error of the side 
chain positioning. Several different algorithms have been developed to solve this 
problem. The most successful methods use a discrete representation of the con- 
formational space. Today, the best methods to solve this problem, are based on 
the dead end elimination theorem. Here we introduce an alternative method. The 
problem is formulated as a linear integer program. This programming problem 
can then be solved by efficient polynomial time methods, using linear program- 
ming relaxation. If the solution to the relaxed problem is integral it corresponds to 
the global minimum energy conformation (GMEC). In our experimental results, 
the solution to the relaxed problem has always been integral. 



1 Introduction 

Within the near future the approximate fold of most proteins will he known, thanks to 
structural genomics projects. The approximate fold of a protein is not enough though, 
to obtain a full understanding of a molecular mechanism or to be able to utilize the 
structure in drug design. For this a complete model of the protein is often needed. The 
main procedure today to obtain a complete model is by “homology modeling”. The 
process of homology modeling often includes the positioning of amino acid side chains 
on a fixed backbone of a protein. Another area that has recently become important is 
the area of automatic “protein design”. Here the goal is to obtain a sequence that folds 
to a given structure. Mayo and coworkers have shown that it is possible to perform 
automatic designs Id- One crucial step in their procedure is to find the optimal side 
chain conformation. 



O. Gascuel and B.M.E. Moret (Eds.): WABI 2001, LNCS 2149, pp. 128- 1^71 2001. 
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There are two common features to most algorithms that try to solve this problem. 
The first one is to discretize the allowed conformational space into “rotamers” rep- 
resenting the statistically dominant side chain orientations in naturally occurring pro- 
teins The rotamer approximation reduces the conformation space and makes it 
possible to use a discrete formulation of the problem. The second feature is the use of 
an energy function that can be divided into terms depending only on pairwise interac- 
tions between different parts of the protein. The total energy of a protein in a specific 
conformation, Ec, can therefore be described as 

Eq = T^bac/cbone T ^ ^ T EE E{irjs) (1) 

i i j>i 

Ebackbone IS the Self-energy of the backbone, i.e. the interaction between all atoms in 
the backbone. E{ir) is the self-energy of side chain i in its rotamer conformation ir, 
including its interaction with the backbone. E{irjs) is the interaction energy between 
side chain i in the rotamer conformation i ^ and the side chain j in the rotamer confor- 
mation js ■ In this study we will keep the backbone of the protein fixed and only change 
the side chain rotamers. The term Ebackbone will therefore not contribute to any differ- 
ence in energy between two protein conformations and can be ignored. The problem 
we want to solve can thus be defined as; given the coordinates of the backbone and a 
specific rotamer library and energy function, find the set of rotamers that minimizes the 
energy function. The solution space of this problem obviously increases exponentially 
with the number of residues included. 




Fig. 1. Schematic Drawing of Different Rotamers in a Fantasy Protein. 



1.1 Methods Used Today 

There are several types of solution methods to the side chain positioning problem. Here 
we briefly review three of them. 
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Stochastic Algorithms. Several groups have developed stochastic algorithms, such as 
Monte Carlo simulations ['31 and Genetic Algorithms o to solve the side chain posi- 
tioning problem. The rotamer approximation together with the energy function makes it 
possible to view the problem as a discrete energy landscape where each point represents 
a specific rotamer combination and an assigned energy. To find a global minimum of 
the energy landscape these algorithms sample solutions semi-randomly and then move 
from one possible solution to another in a manner that depends both on the nature of 
the energy landscape and on specific rules for movement. With this approach you only 
need to compute the energy for the sampled conformations. Another advantage of these 
algorithms is that you can allow the structure of the protein to vary continuously if you 
want. One limitation, on the other hand, is that you are never guaranteed to have found 
the global minimum. 



Pruning Algorithms. The most frequently used algorithms in this category are based 
on the dead end elimination theorem (DEE) DEE-based methods iteratively 

use rejection criteria. If a criterion is fulfilled it guarantees that a certain rotamer or 
combination of rotamers cannot be a part of the GMEC. This reduces the conformation 
space significantly and hopefully at the end a single solution remains. If DEE-based 
methods converge to a single solution they are guaranteed to have found the global min- 
imum energy. During the last few years methods based on this theorem have developed 
considerably and can now solve most side-chain positioning problems [ClHI. However 
in protein design there is a limit beyond which this method fails to converge and other 
inexact methods have to be used m- The reason for this is that protein design requires 
a large rotamer library [0. 



Mean-Field Algorithms. A third approach is to use mean field algorithms 1 11 ll . Here 
all rotamers of all side chains can be thought of as existing at the same time, but with 
different probabilities. Each residue is considered in turn and the probabilities for all ro- 
tamers of that side chain are updated, based on the mean field generated by the multiple 
side chains at neigbouring residues. The procedure is repeated until it converges. The 
predicted conformation of a side chain is chosen to be the rotamer with the highest prob- 
ability. An advantage of self consistent mean field algorithms is that the computational 
time scales linearly with the number of residues. Unfortunately, there is no guarantee 
that the minimum of the mean-field landscape corresponds to the true GMEC. 



2 The Side Chain Positioning Prohlem Formulated as an Integer 
Program 

The side-chain positioning problem can be formulated as a classical mathematical pro- 
gramming problem. To this end it is necessary to introduce some notations and back- 
ground. Mathematical programming concerns maximizing (or minimizing) a function 
of decision variables, which we denote by x, in the presence of constraints, which re- 
strict the variables x to belong to a set F of admissible values, the feasible set. The 
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function to optimize, the objective function, is denoted f{x). A maximization prob- 
lem can easily be turned into a minimization problem by changing the sign of f{x). A 
general mathematical programming problem can then be stated as 

Minimize f{x) 

subject to gi{x) <0, i = 1, . . . ,m 

In that case the decision variables only can take integer values we have an integer pro- 
gramming problem. The easiest programming problems have continues variables and 
linear constraints and objective functions. These are called linear programming (LP) 
problems. Please refer to 111 211 311 for a thorough introduction to linear programming 
and integer programming. 

It is possible 0, to rewrite the energy function dl]) so that it only contains terms 
depending on pairs of side chains, i.e., the energy of a conformation can be written as 

Ec = Y.Y.E'i^r,Js). (2) 

i j>i 

This benefits our problem formulation. Let the decision variables be 

f 1 if side chain i is in rotamer r and side chain j is in rotamer s 
^-’^■^ = |0else 

where i = 1 . . . rise, j = 2 . . . Ugc, i < j and Ugc is the number of side chains. The 
total energy of the system (not just one conformation as before) can then be calcu- 
lated as a sum, consisting of all possible variables, one for each rotamer combination, 
ir,js, times their respective energy contribution to the total energy e = E'{ir,js), 
see equation (Q. The energy is calculated assuming both of these rotamers are 
included in the conformation. The total energy, E tot is then: 

Etot = ^irds^irds ( 4 ) 

i r jj>i s 



where 



EE Xi^j^ = 1 for all i and j, i < j (5) 

r s 

E! = E! ^3p,»r = E! ~ E! (^) 

q p s t 

for all g, h, i, j, k and r, h, g < i < j, k 

e{0,l} (7) 

The first condition together with 0 are to assure, that a certain pair of side chains, i 
and j, only can exist in one rotamer-state, i r and jg. If you consider a certain pair of side 
chains, say i and j, and think of all possible combinations of rotamer states that these 
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two side chains could possibly take, then only one of these combinations can exist. This 
could for example htir and jg. The second condition states that one side chain can 
only exist in one rotamer state, independent of the rotamer states of other side chains. 
This assures that it can not exist in one state in relation to one side chain and another 
state in relation to another side chain. 

This allows us to formulate a minimization problem as: 

Minimize: Etot (8) 

subject to: © 0 CD 

This is a linear integer programming problem. We can rewrite it in matrix form: 

Zip = mm{cx\Ax = b,x £ {0, 1}"} (9) 

where A is an m x n matrix and b and c vectors of appropriate dimensions. In our case 
all components of A are 0, 1 or —1, the components of & 0 or 1 and c consists of real 
values. The condition 0 can of course be implemented in different ways to obtain this 
form. We refer to the Appendix for a description of our approach. A small example in 
TableQJshows what the integer program can look like for a small side chain positioning 
problem and our implementation. 

In general, integer problems are hard to solve. By relaxing the integral constraints, 
that is, allowing 0 < Xi^j^ < 1 instead of Xi^j^ G {0, 1}, we can turn this problem 
into a LP-problem to which there exist several efficient algorithms. If the solution to 
the LP relaxed problem is integral, Xi^j^ G {0,1} then it is an optimal solution to 
the original problem Id. This means that if we solve the relaxation of the side chain 
positioning problem © and the solution turns out to be integral, it corresponds to the 
GMEC. The linear programming relaxation is as follows 

zlp = min{ca:|Aa: = &, 0 < a: < 1} (10) 

The feasible solution set of the problem, F{LP), defined by Ax = b and 0 < a: < 1, 
forms a convex polyhedron, as it is an intersection of a finite number of hyper-planes. 
The LP-problem is well defined in thaf eifher if is infeasible, unbounded, or has an 
optimal solufion. An optimal solufion fo an LP problem can always be found in an ex- 
treme point of the polyhedron (if there exists an optimal solution). Lor a more thorough 
introduction to the ideas of the linear programming methods we refer to El- 

It can be shown that there is a polynomial time method to solve LP, the proof is nor- 
mally based on the ellipsoid method ||14||. This method is, however, not appreciated in 
practice due to its slowness. Today, two popular methods are used. The idea of the sim- 
plex method is to move from one extreme point to another in such a way that the value 
of the objective function will always decrease or at least not increase. This is done until 
a minimum is reached or until the problem is found to be unbounded. Although there is 
an example showing that the simplex algorithm is of exponential time, it is very efficient 
if implemented well and in reality it is often of polynomial time [O. The interior point 
methods is a family of algorithms that stay in the strict interior of the feasible region, 
such as using a barrier function. The term grew from Karmarkar’s algorithm to solve 
a LP-problem m. Except for certain extreme worst cases, the interior point method 
runs in polynomial time. 
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Table 1. The A, b and c matrix, see (HIB of a trivial example protein with four residues 
a, b, c and d, where a has three rotamers oi, 02 , 03 , b two, c three and d two rotamers. 
These matrices are used as the input to the simplex algorithm. The order of the de- 
cision variables x are also shown. In these matrices the row: (ab=l) corresponds to 
= 1 , (a lb=alc) corresponds to and (alc=ald) 

corresponds to = Thu - the rows (alb=alc) and (alc=ald) together 

say that rotamer a\ either exists or not and can not do both on the same time. The rest 
of the rows correspond to similar constraints. 



X X Xa^c;^^ Xa\c^ Xa^c^ Xa\d'2 Xo,2h\ • • • Xa^d,^ Xa^b;^ • • • Xa,^d2 Xb^c^^ 

XbiC2 Xb-^d\ Xb^d2 Xb2C\ • • • Xb2d2 Xc-^^di Xc^d2 Xc2d\ X^2d2 Xc^di^ ^(^ 3 ^ 2 ) 

c= (1.43... -0.54) 
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o 
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C2 



To use LP-methods alone when solving an integer problem we have to prove that 
the solution we find is always integral. We have not been able to do so but our results 
indicate that this could be the case. If the the solution is fractional, branch and bound 
methods can be used in combination with LP to get an integer solution. 



Time Complexity. Let rise be the number of side chains and rirot be the average num- 
ber of rotamer states for each side chain. Then the integer programming formulation of 
the problem © will give approximately nl^rieot number of constraints. For the sim- 
plex algorithm it is usually counted on that the number of iterations grows linearly with 
the number of constraints of the problem and the work for each iteration grows as the 
square of the number of constraints im. That is the computational time of solving the 
relaxed problem by the simplex method should be in the order of 0{n^sc)- 
at the average number of rotamers for each residue the time complexity for the simplex 
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algorithm ought to be 0{n^g^). This is under the assumption that the solution to the 
relaxed problem is integral and therefore the final solution. 

3 The Dead End Elimination Theorem 

Dead end elimination methods seeks to systematically eliminate bad rotamers and com- 
bination of rotamers until a single solution remains. This is done by the iterative use of 
different criteria of elimination, which make it possible to exclude rotamers from being 
a part of the GMEC. 

It is necessary for the DEE methods that the energy description is pairwise as de- 
scribed in equation o. Let us consider one residue, say residue i and let {C"} be the set 
of all possible conformations of the rest of the protein, that is, all possible combinations 
of rotamers of all side chains except side chain i. Now consider two rotamers of residue 
i, namely v and it. If every protein conformation with is higher in energy than the 
corresponding conformation with it for all possible configurations of non-i residues 
then, 

E{ir) + ^ E{irJs) > E{it) + ^ E{itjs) all C (11) 

and the rotamer v cannot be a member of the GMEC. The inequality (m) is not easy 
to check computationally since {C"} can be huge. Instead one can get a more practical, 
but weaker, condition by just comparing that case where the left side of is highest 
in energy with the case where the right side is lowest in energy. This can be done by the 
following inequality known as the DEE-theorem introduced by Desmat et al. [El 

E{ir) + ^ umiE{irjs) > E{it) + ^ u\&yiE{itjs) i j- ( 12 ) 

The theorem states that the rotamer ^ is a dead ending, thus precluding to be a 
member of the GMEC, if the inequality (EJ holds true for any rotamer it ^ v of 
the side chain i . For a proof of this theorem we refer to [El ■ In words inequality (fT2h 
says that rotamer v must be a dead ending if the energy of the system taken at ir'.'s 
most favorable interactions with all the other residues are bigger than the energy of 
another rotamer ( i*) at its worst interactions. The best and worst interactions are given 
by “choosing” that rotamer which gives the lowest, respectively the highest, value for 
each interaction term. Desmat also described how the criteria could be extended to the 
elimination of pairs of rotamers inconsistent with the GMEC. 

A more powerful version of these criteria was introduced by Goldstein [Q. It sub- 
tracts the right-hand side from the left in equation (El before applying the min operator. 

E{ir) - E(it) + ^ u\\Ti{E(irjs) - E{itjs)) > 0 ( 13 ) 

3d¥=i 

This criterion says that ir is a dead ending if we can always lower the energy by taking 
rotamer it instead of L while keeping the other rotamers hxed. Goldstein’s criterion 
can also be extended to pairs of rotamers inconsistent with the GMEC. These criteria 
are used iteratively and one after each other until no more rotamers can be eliminated. 
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If the method converges to a single solution, this is guaranteed to be the GMEC. In 
computational time calculation using the rotamer-pairs is significantly more expensive 
than with single rotamers. 

Other improvements of the DEE-method have been done H 91171 . They have made 
it possible to eliminate more rotamers and to accelerate the process of elimination. 
Today DEE based methods can handle most side chain positioning problems [QS!- 0ns 
still reach a limit though, in design problems where this method fails to converge in a 
reasonable amount of time mni- This means that the conformational space can not be 
reduced to a single solution. In this study we have only utilized Goldstein’s criterion for 
single residues o. 

4 Methods 



The energy function used in this study consists of van der Waals forces between atoms 
that are more than three bonds away from each other. Van der Waals parameters were 
taken from CHARMM22 iH parameter set (par jll 224 )rot). In this study we have 
focused on the algorithmic part of the side chain positioning problem, i. e. finding 
the global minimum energy conformation of the model. The next step would be to 
use a more appropriate energy function. We used a backbone-independent rotamer li- 
brary of R. Dunbrack o (bbind99.Aug.lib), that contains a maximum of 81 rotamers 
per residue. Eurther, all hydrogen atoms were ignored. All calculations have been per- 
formed using the backbone coordinates of the lambda repressor (PDB-code: lr69). The 
energy contributions of each rotamer pair to the total energy were calculated in advance 
and stored in a vector c, see (THII) . The two matrices A and b were constructed. An ex- 
ample of these matrices for a small problem can be seen in Table 0 These matrices 
were then used as the input to the linear programming algorithms. Eirst a simplex al- 
gorithm, lp_solve_3. 1 El, was used. Secondly the problem was solved using a mixed 
integer programming (mip) algorithm from the CPLEX package This algorithm 
is designed to solve problems with both integral and real variables. It makes use of the 
fact that we have an integer problem, and finds structures in the input that could be used 
to reduce the problem. After a relaxation of the integer constraints it solves the obtained 
LP-program by the simplex method. If the solution is not integral a branch and bound 
procedure takes place. We have also tried other algorithms from the CPLEX package 
such as primal and dual; simplex, hybrid network and hybrid barrier solvers. The mip 
solver gave the best results. 

For comparison we have also performed exhaustive search for up to 10 side chains 
and implemented a single-residue based DEE-algorithm, using Goldstein’s criterion 
(HI- We also implemented a combination of the DEE-theorem and the linear program- 
ming (CPLEX-mip). Here we used condition m iteratively until no more rotamers 
could be eliminated and then applied linear programming on the rest of the solution 
space to find the GMEC. 

All algorithms were tested on identical systems, and we assured that the correct 
solution was found. In Table El the size of the conformational space of some of the 
problems can be seen. As several different programs were used for the calculations, it is 
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Table 2. Some of the problems used in the study; Usc is the number of free side chains, 
firot is the average number of rotamers per side chain. The fraction of rotamers left after 
the DEE-search and the total number of conformations in the problems are also shown. 
The protein used is the lambda repressor (PDB-code:lr69). 
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not trivial to compare the absolute computational time needed. It is, however, possible 
to compare their complexity. 



5 Results and Discussion 

Today there are mainly three types of methods used for the side chain positioning 
problem, stochastic methods, mean-field algorithms and DEE-theorem based methods. 
Stochastic methods and mean-field algorithms always find a solution, however this is 
not guaranteed to be optimal. If a DEE method converges to one solution, it is guar- 
anteed to be the global optimal solution. However, for larger problems DEE methods 
do not always converge. Therefore, the remaining solutions have to be tested by e.g. 
exhaustive search. Here we introduce a novel method using linear programming. If the 
solution from the LP-method is integral it corresponds to the GMEC. 



Integral Solutions. In our experiments, see below, the solutions were controlled for 
integrality and so far we have never found a fractional solution. The mip method from 
the CPLEX package is made for integer programs. It uses a simplex solver, followed by 
branch-and-bound with simplex if the solution is not integral, see Section 0 In our ex- 
periments the branch-and-bound part of the algorithm was never used. We have also 
used primal and dual; simplex, hybrid network and hybrid barrier solvers from the 
CPLEX package. We received integral results from all these solvers. To examine if the 
energy function has any effect on the integrality of the solutions we have tried different 
nonsense energy functions on a test case with the simplex algorithm. All the solutions 
found were integral. 
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Fig. 2. Experimental studies of the complexity of the two linear programming algo- 
rithms versus the number of side chains. The circles represent the simplex method and 
the crosses the mip method. The complexity is approximately 0{nl^) and 0{nl^) re- 
spectively. For comparison a curve representing the exhaustive search is also included. 



Experimental Time Complexity. First the number of free side chains was increased, 
Use, to examine the time behavior of the linear programming methods. The computa- 
tional time for the mip method and the simplex method (from Ip .solve 3.1) are shown 
in Figure 121 It can be seen that the time scales approximately as 0(n^^) for the mip 
method and as 0(n,g^) for the simplex method. This agrees quite well with the estima- 
tion of the complexity (see Section 0, where the computational time was calculated 
to scale as for the simplex method. Furthermore, it shows the superiority of 

the mip-method. The reason that the mzp-method has a better complexity is due to the 
preprocessor, which finds structures in the input that can reduce the problem size and 
increase the speed. This also means that there probably exists better ways to implement 
the conditions of (0. 

In protein design the average number of rotamers for each side chain, n rot, is often 
very large. Therefore, it is interesting to study the complexity of the LP-algorithms 
versus rirot- Our estimation of the time complexity for the simplex method on the side 
chain positioning problem is 0{n^r,t), see Section|21 This agrees quite well with our 
experimental study, see Figure |3 where the complexity is approximately both 

for the simplex and mip algorithms. 



Comparison with the DEE-Method. The DEE-method is a pruning method that con- 
secutively eliminate parts of the conformation space. First, simple criteria with a low 
complexity are used and then slower methods are applied until convergence. For large 
design problems DEE-methods do not converge in a reasonably amount of time [QHI. 
This means that the time complexity for large system could be exponential and not 
polynomial. All this makes it difficult to calculate an overall time complexity of DEE- 
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Fig. 3. Experimental studies of the complexity of the two linear programming algo- 
rithms used in this study versus the number of rotamers. The number of side chains is 
8. The circles represent the simplex method, the crosses the mip method and the stars 
a DEE-algorithm with a final exhaustive search, see methods. For a problem of this 
size the DEE-algorithm almost converges. Therefore, the computational time of the ex- 
haustive search is negligible. The complexity is approximately (mip), 0{n^gf) 

(simplex) and 0(n^;,°) (DEE). 



methods. In a study by Pierce et al G3, the time complexity of DEE was estimated 
in terms of the cost of the nested loops required to implement each approach. For the 
different criteria their estimation was between and 0{n^^n^gf.). 

To perform a comparison between DEE and LP a part of the DEE method, Gold- 
stein’s criterion for single residues was implemented. However, we did not implement 
other criteria. The comparison of the two methods can therefore only be seen as an 
indication of their relative performance. 

In our implementation, where the DEE did not converge to one solution, the re- 
maining conformational space was searched exhaustively. When the problem contained 
more than approximately 8 residues the DEE algorithm did not converge. In Figure 0 
it can be seen that for larger systems our DEE implementation does not show a poly- 
nomial complexity. With a more complete DEE-method it ought to be possible to limit 
the remaining conformational space so that a single solution is obtained for much larger 
systems. However, the complexity for the DEE algorithm would then be larger. 

What might be more interesting is to compare the DEE implementation with the 
TOzp-linear programming method. If one only consider the complexity of the DEE- 
algorithm before the exhaustive search is started, the complexity is approximately equal 
to the LP-method, see Figure 0 However, while the LP-method has found the GMEC, 
the DEE-algorithm has not converge to a single solution for most problems. Table |3 
In Figure 01 the time complexity of a combination of this DEE-algorithm and the LP- 
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Fig. 4. Experimental studies of the complexity of the mip linear programming algo- 
rithm and the Dead End Elimination algorithm, versus the number of side chains. The 
circles represent the linear programming mip method alone, the crosses a combina- 
tion of DEE and LP, the stars the DEE with a final exhaustive search and the ’-t’-signs 
the DEE-algorithm alone. The complexity for mip and the combined methods is ap- 
proximately O(ng^) and 0{nl,?) respectively. The complexity of the DEE part of the 
computation is the dashed line. 



algorithm mip is also shown. Here, the time for the combined method is less than for 
mip alone, but the complexity is a little bit better for mip. 

We have also made a study of the DEE-algorithm (Goldstein’s singles criterion) 
with an increasing number of rotamers, see Eigure 0 Here the problem was small 
enough (8 residues) for the DEE-method to eliminate almost all rotamers. The com- 
plexity of the DEE-algorithm was O(n^Qj), i.e. almost identical with the LP-methods. 
However, the DEE-algorithm was faster. 



6 Conclusions 

We have introduced a novel solution method to the problem of optimal side chain posi- 
tioning on a hxed backbone. It has earlier been shown that hnding the optimal solution 
to this problem is essential for the success of automatic protein designs [□. The state of 
the art methods include several rounds using different versions of the Dead End Elim- 
ination theorem, with an increasing time complexity. This method do not necessarily 
converge to a single solution but the conformational space is reduced significantly and 
the remaining solutions can be searched. 

By using linear programming (LP) we are guaranteed to find an optimal solution 
in polynomial time. If this solution is integral it corresponds to the global minimum 
energy conformation. This far in our studies the solutions have always been integral. 
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Linear programming is a well studied area of research and many algorithms for 
fast solutions are available. We obtained the best results using the mzp-method from 
the CPLEX package. The time complexity for the mip-method to find the GMEC 
was while our DEE implementation (Goldstein’s singles criterion), had 

a similar complexity of However, for the mip-method the GMEC was 

found while for the DEE-method a fraction of the conformation space remained to 
be searched. More advanced DEE implementations converge to a single solution for 
larger problems, but they use more time consuming criteria, the worst with an esti- 
mated complexity of 0{nl^n^g^) ani- As the complexity for the mip-method has a 
smaller dependency on the number of rotamers, the use of LP-algorithms might be best 
for problems with many rotamers, as in protein design. One reason for the effectiveness 
of the mip-algorithm is most likely due to the preprocessing of the problem. There- 
fore, a reformulation of the input matrices t mib to the simplex algorithm, perhaps as a 
network ||21||. might improve the complexity even further. 
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A Implementation of the Integer Program 

To transform the integer program describing the side chain positioning problem ( 0 ) into 
the following form 

Zip = min{cx\Ax = b,x G {0,1}'^} (14) 

we have formulated the condition (0) as 

^ ^ ,(J-|-l)t “0, i = 1 .. . {Usc 2) , j = 2 . . . (rise 1) ) i ^ j 

s t 

'y ' ~ y ' “0, i = 1 . . . (rise ~ 2) , j = 3 . . . rise, (* + 1) < j 

r t 

= 0, z = l...{rise - 2) ,3 =i+l. 

r t 
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Abstract. There are very few instances in which positive Darwinian selection 
has been convincingly demonstrated at the molecular level. In this study, we 
present a novel test for detecting positive selection at the amino-acid level. In this 
test, amino-acid replacements are characterized in terms of chemical distances, 
i.e., degrees of dissimilarity between the exchanged residues in a protein. The test 
identifies statistically significant deviations of the mean observed chemical dis- 
tance from its expectation, either along a phylogenetic lineage or across a subtree. 
The mean observed distance is calculated as the average chemical distance over 
all possible ancestral sequence reconstructions, weighted by their likelihood. Our 
method substantially improves over previous approaches by taking into account 
the stochastic process, tree phytogeny, among site rate variation, and alternative 
ancestral reconstructions. We provide a linear time algorithm for applying this 
test to all branches and all subtrees of a given phylogenetic tree. We validate 
this approach by applying it to two well-studied datasets, the MHC class I gly- 
coproteins serving as a positive control, and the house-keeping gene carbonic 
anhydrase I serving as a negative control. 



1 Introduction 



The neutral theory of molecular evolution maintains that the great majority of evolu- 
tionary changes at the molecular level are caused not by Darwinian selection acting on 
advantageous mutants, but by random fixation of selectively neutral or nearly neutral 
mutants HI- There are very few cases in which positive Darwinian selection was con- 
vincingly demonstrated at the molecular level 1[1()I22I34B0123II . These cases are vital 
to understanding the link between sequence variability and adaptive evolution. Indeed, 
it has been estimated that positive selection has occurred in only 0.5% of all protein- 
coding genes 

The most widely used method for detecting positive Darwinian selection is based 
on comparing synonymous and nonsynonymous substitution rates between nucleotide 
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sequences EH- Synonymous substitutions are assumed to be selectively neutral. If only 
purifying selection operates, then the rate of synonymous substitution should be higher 
than the rate of nonsynonymous substitution. In the few cases where the opposite pat- 
tern was observed, positive selection was invoked as the likely explanation (see, e.g., 
13311411 '!. One critical shortcoming of this method is that it requires estimating num- 
bers of synonymous substitutions. Because of saturation, such estimation is virtually 
Impossible when the sequences under study are evolutionarlly distant. The estimation 
is problematic even if close species are concerned. For example, saturation of substi- 
tutions in the third position is evident even when comparing cytochrome b sequences 
among species within the same mammalian order m- 

Another method for detecting positive selection is searching for parallel and con- 
vergent replacements. It is postulated that such molecular changes in different parts of 
a phylogenetic tree can only be explained by the same selective pressure being exerted 
on different taxa that became exposed to the same conditions [|SE1. This method is 
limited to the few cases in which the same type of positive Darwinian selection occurs 
in two or more unrelated lineages. 

A third method of detecting positive selection is based on comparing conservative 
and radical nonsynonymous differences im- Nonsynonymous sites are divided into 
conservative sites and radical sites based on physiochemical properties of the amino- 
acid side chain, such as volume, hydrophobicity, charge or polarity. Radical and conser- 
vative sites and radical and conservative replacements are separately counted, and the 
number of radical replacements per radical site is compared to the number of conser- 
vative replacements per conservative site. If the former ratio is significantly higher than 
the latter, then positive Darwinian selection is invoked. By using this method, positive 
selection was inferred for the antigen binding cleft of class I major-histocompatiblllty- 
complex (MHC) glycoproteins m and rat olfactory proteins [0. This method for detect- 
ing positive selection has the advantage that distant protein sequences can be compared 
even when synonymous substitutions are saturated. Another virtue of this method is its 
flexibility with respect to the sequence characteristic tested. For example, if we suspect 
that polar replacements might be advantageous, a test can be applied with radical re- 
placements defined as those occurring between amino-acids with polar and non-polar 
residues only. However, this method also has many shortcomings. First, no correction 
for multiple substitutions is applicable m. Second, each codon in a pair of aligned 
amino-acid is used twice: Once for estimating the number of radical and conservative 
sites, and once for estimating the number of radical and conservative replacements. 
Third, the method treats replacements between different amino-acids as equally prob- 
able. Fourth, the method ignores branch lengths, implicitly assuming independence of 
the replacement probabilities between the amino acids and the evolutionary distance 
between the sequences under study. Finally, the phylogenetic signal is ignored, i.e., the 
test is applied to pairwise sequence comparisons rather than testing hypotheses on a 
phylogenetic tree. 

The test for positive selection proposed in this study overcomes the shortcomings of 
the radical-conservative test. Our test incorporates a probabilistic framework for deal- 
ing with radical vs. conservative replacements. It applies a novel method for averaging 
over ancestral sequence assignments, weighted by their likelihood, thus eliminating bias 
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which might result from assuming a specific ancestral sequence reconstruction. The ra- 
tionale underlying our proposed test is that the evolutionary acquisition of a new func- 
tion requires a significant change of the biochemical properties of the amino-acid se- 
quence o . To quantify this biochemical difference between two amino-acid sequences, 
we define a chemical distance measure based on, e.g., Grantham’s matrix m- Our test 
identifies large deviations of the mean observed chemical distance from the expected 
distance along a branch or across a subtree in a phylogenetic tree. If the observed chem- 
ical distance between two sequences significantly exceeds the chance expectation, then 
it is unlikely that this is the result of random genetic drift, and positive Darwinian se- 
lection should be invoked. 

Based on the assumed stochastic process, the tree topology and its branch lengths, 
we calculate both the mean observed chemical distance and its underlying distribution 
for the branch or subtree in question. The mean observed chemical distance is calculated 
as the average chemical distance over all ancestral sequence reconstructions, weighted 
by their likelihood, thus, eliminating possible bias in a calculation based on a particular 
ancestral sequence reconstruction. The underlying distribution of this random variable 
is calculated using the ITT stochastic model m , the tree topology and branch lengths, 
taking into account among site rate variation. We provide a linear time algorithm to 
perform this test for all branches and subtrees of a phylogenetic tree with n leaves. 

In order to validate our approach, we applied it to two control datasets: Class 1 
major-histocompatibility-complex (MHC) glycoproteins, and carbonic anhydrase I. 
These datasets were chosen since they were already used as standard positive control 
(MHC) and negative control (carbonic anhydrase) for positive selection [l2l- For the 
MHC class I dataset, as reported in B, we observe positive selection which favors 
charge replacements only when applying the test to the subsequences of the binding 
cleft (P < 0.01). In addition we observe positive selection which favors polarity re- 
placements when using Grantham’s polarity indices m (P < 0.01). When applying 
the test to the carbonic anhydrase dataset, no positive selection is observed. 

The paper is organized as follows: Section|2|contains the notations and terminology 
used in the paper. Section Elpresents the new test for positive Darwinian selection. Sec- 
tion 0 describes the application of this test to the two control datasets. Finally, Section 0 
contains a summary and a discussion of our approach. 



2 Preliminaries 



Let A be the set of 20 amino-acids. We assume that sequence evolution follows the ITT 
probabilistic reversible model [El. For amino-acid sequences this model is described 
by a 20 X 20 matrix M, indicating the relative replacement rates of amino-acids, and a 
vector {Pa , ■ • • , Py ) of amino-acid frequencies. For each branch of length t and amino- 
acids i and j, the i ^ j replacement probability, denoted by Pij{t), can be calculated 
from the eigenvalue decomposition of M [El- (In practice, an approximation to Pij{t) 
is used to speedup the computation m.) We denote by = PyP^j{t) = 
the probability of observing i and j in the same position in two aligned sequences of 
evolutionary distance t. 
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Let s be an amino-acid sequence. The amino-acid at position i in s is denoted by 
Si. For two amino-acids a,b G A, we denote their chemical distance by d{i,j)- We 
assume we have a table of chemical distances between every pair of amino-acids. One 
such distance is Grantham’s chemical distance |@1|. (Other similar distance measures 
appear in mm-) This chemical distance measures the difference between two amino- 
acids in terms of their volume, polarity and composition of the side chain. The choice 
of which distance measure to use, reflects the type of test we wish to perform. For 
example, Grantham’s distance is appropriate when testing whether the replacements 
between the sequences under question are more radical with respect to a range of phys- 
iochemical properties (volume, charge and composition of the side chain). For testing 
whether polarity differences between sequences are higher than the random expecta- 
tion, two distance measures are applicable: The first measure is based on dividing the 
set of amino-acids into 2 categories: Polar (C,D,E,H,K,N,Q,R,S,T,W,Y) and non-polar 
(the rest). The polarity distance between two amino-acids is then defined as 1 if one is 
polar and the other is not, and 0 otherwise [0. The second polarity distance is defined as 
the absolute difference between the polarity indices of the two amino-acids, and yields 
real values 0. For testing charge differences 3 categories of amino-acids are defined: 
Positive (H,K,R), negative (D,E) and neufral (all other). The charge distance between 
two amino-acids is defined as 1 if they belong to two different categories, and 0 if they 
belong to the same category Q . 

We define the average chemical distance between two sequences s ^ and s^ of length 
N as the average of the chemical distances between pairs of amino-acids occupying the 
same position in a gapless alignment of and s^: 



Let T be an unrooted phylogenetic tree. For a node v, we denote by N{v) the set 
of nodes adjacent to v. For an edge (m, u) G T we denote by f(rt, v) the length of the 
branch connecting u and v. 

3 A Test for Positive Darwinian Selection 

In this section we describe a new test for detecting positive Darwinian selection. The 
input to the test is a set of gap-free aligned sequences and a phylogenetic tree for them. 
We first present a version of our test for a pair of known sequences. We then extend 
this method to test positive selection on specific branches of a phylogenetic tree under 
study. Finally we generalize the test to subtrees (clades) and incorporate among site rate 
variation. 

3.1 Testing Two Known Sequences 

Let and be two amino-acid sequences of length N and evolutionary distance t. 
The underlying distribution of D{s^, s^) is inferred as follows. The expectation of the 
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chemical distance at position i is; 

E{d{s\,si)) = ^ d{a,b)fab{t) 
a,b£A 

Assuming that the distribution of the chemical distance in each position is identical, we 
obtain 

1 ^ 

E{D{s\s^)) = - ^ E{d{s},sj)) = E{d{slsl)) 

The variance of the chemical distance at position i is: 

V{d{sl , s?)) = E{d{sl , - E{d{sl , ^ d{a, bfU{t) - E{d{sl , 



and assuming further that sequence positions are independent, we obtain 



V{D{s\s^)) 



V{d{sl4)) 

N 



For practical values of Ai, Z?(s^,s^) is approximately normally distributed with ex- 
pectation E{D{s^, s^) and standard deviation ^yV{D{s^, s^)). This allows us to com- 
pute for each observed chemical distance d, the probability that it occurs by chance, 
i.e., its p-value. If the observed chemical distance is found above the 0.99 percentile 
of the normal distribution, we conclude that replacements in these two sequences sig- 
nihcantly deviate from the expectation, and suggest positive selection to explain this 
phenomenon. 



3.2 Testing a Tree Lineage 

Here we first describe a general method to apply pairwise tests to a phylogenetic tree. 
Suppose that we wish to test a statistical hypothesis on a specihc branch of the phy- 
logenetic tree. Also suppose that we have a procedure to test our hypothesis on a pair 
of known sequences, like the procedure described above. In order to test our hypoth- 
esis on a specific branch, we could hrst infer the corresponding ancestral sequences 
(using, e.g., maximum likelihood estimation and then check our hypothesis. In- 
ferring ancestral sequences and then using these sequences as observations was done in 
e.g., El- This approach, which treats estimated reconstructions as observations may 
lead to erroneous conclusions due to bias in the reconstruction. A more robust approach 
is to average over all possible reconstructions, weighted by their likelihood. By aver- 
aging over all possible ancestral assignments, we extend our test to hypothesis testing 
on a phylogenetic tree, without possible bias that results from reconstructing particular 
sequences at internal tree nodes. 

We describe in the following how to apply our test to a specihc branch connecting 
nodes x and y in a tree T. Since we assume that different positions evolve independently 
we restrict the subsequent description to a single site. 
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Each branch (u, v) € T partitions the tree into two subtrees. Let L{u, v, a) denote 
the likelihood of the subtree which includes v, given that v is assigned the amino-acid 
a. L{u, V, a) can be computed by the following recursion equation: 

L(u, V, a) = n {E Pab{t{v,w)) ■ L{v,w,b)} 

w^{N{v)\{u}) b^A 

For a leaf v at the base of the recursion we have L{u, v, a) = 1, assuming amino-acid 
a in V, and L{u, v,a) = 0 otherwise. 

The likelihood of T is thus: 

Pr = ^ fab{t{u,v)) ■ L{u,v,b) ■ L{v,u,a) 
a,h^A 

where ( m , v) is any branch of T. 

Suppose that the data at the leaves of T is lu = {wi, ... ,Wn)- The mean observed 
chemical distance for a given branch (x, y) gT can be calculated as follows: 

D{x,y) — ^ Pr{x = a,y = b\w) ■ d{a,b) 

a,bGA 

= ^ E y)) ■ L{x, y, b) ■ L{y, x, a)} 

^ a,b^A 

It remains to compute the null distribution of this statistic. The expectation of 
D{x, y) (with respect to all possible leaf-assignments) is as follows: 



E{D{x,y))= ^ Pr{z)Y^ Pr{x = a,y = b\z) ■ d{a,b) 
z£A'^ a,b£A 

= E E Pfiz) ■ Pr{x = a,y = b\z) 

a.bGA zGA" 

= E d{a,b) ■ fab{t{x,y)) 
a,b^A 

We conclude that E{D{x, y)) is the same as in the known-sequences case. For the 
variance of D{x, y) we have no explicit formula. Instead, we evaluate V {D{x, y)) using 
parametric bootstrap ES- Specifically, we draw at random many assignments of amino- 
acids to the leaves of T and compute I7(x, y) for each of them, thereby evaluating its 
variance. An assignment to the leaves of T is obtained as follows: We first root T 
at an arbitrary node r. We then draw at random an amino-acid for r according to the 
amino-acid frequencies. We next draw amino-acids for each child of r according to the 
appropriate replacement probabilities of our model, and continue in this manner till we 
reach the leaves. 

Finally, since D{x, y) is approximately normally distributed, we can compute a p- 
value for the test, which is simply PrlZ > -P(^.y)--E^(-P(^.i')) t where Z ^ 

^ - y'ViDix.v)) ’ 

Normal{0, 1). Note, that if the test is applied to several (or all) branches of the tree. 
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then the signihcance level of the test should be corrected in accordance with the number 
of tests performed, e.g., using Bonferroni’s correction which multiplies the p-value by 
the number of branches tested. 

The algorithm for testing the branches of a phylogenetic tree T is summarized in 
Figure m For each branch {x, y) G T the algorithm outputs the p-value of the test for 
that branch. In the actual implementation we used M = 100. 



PositiveSelectionTest(T) : 

Root T at an arbitrary node r. 

Draw M assignments to the leaves of T using parametric bootstrap. 

Traverse T bottom-up, computing along the way for every (u, v) € T, a € A 
the value of L{u, v, a), where u is the parent of v. 

Traverse T top-down, computing along the way for every (w, v) G T,a G A 
the value of L(v, u, a), where u is the parent of v. 

For every (x, y) € T do: 

Calculate D{x, y) and E{D{x, y)). 

Evaluate V {D{x, y)). 

Output the p- value for the branch (x, y). 



Fig. 1. An Algorithm for Testing the Branches of a Phylogenetic Tree T. 

Theorem 1. For a given phylogenetic tree T with n leaves, the algorithm tests all 
branches ofT in 0{n) time. 

Proof. Given L(u, v, a) for every {u, v) G T and every aGA,it is straightforward to 
compute D{u, v) for all (u, v) G T in linear time. The computation of E{D{u, n)) and 
V {D{u, v)) is clearly linear. The complexity follows. 

3.3 Testing a Subtree 

In this section we present an extension of our method to test subtrees of a given phylo- 
genetic tree T. This is motivated by the consideration that if a clade of contemporary 
sequences has undergone positive Darwinian selection, we cannot necessarily assume 
that this selection occurred solely along the branch leading to that clade. A reasonable 
scenario is that the selection was continuous and occurred along several or all branches 
of the subtree corresponding to this clade. In such a case, the test we have just described 
may not detect any significant positive selection along any specific branch. Hence, we 
are interested at testing for positive selection across subtrees as well. 

For a subtree T' of T, we define the mean observed chemical distance D{T') as 
the average observed distance along its branches (i.e., the sum of the observed distance 
for each branch divided by the number of branches in T')- Clearly, the expectation of 
D{T') is equal to the average expectation of the branches of T'. The variance of D{T') 
can be evaluated using parametric bootstrap. We then use the normal approximation to 
compute a p- value for this test. We conclude: 

Theorem 2. For a given phylogenetic tree T with n leaves, the complexity of testing 
all its subtrees is 0{n). 
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3.4 Introducing among Site Rate Variation 

The rate of evolution is not constant among amino-acid sites m. Consider two se- 
quences of length N. Suppose that there are on average I replacements per site between 
these sequences. This means that we expect IN replacements altogether. How many 
replacements should we expect at each particular site? Naive models assume that the 
variation of mutation rate among sites is zero, i.e., that all sites have the same replace- 
ment probability. Models that take this Among Site Rate Variation (ASRV) into account 
assume that at the j-th position the average number of replacement is lr[j], where each 
r = r[j] is a rate parameter drawn from some probability distribution. Since the mean 
rate over all sites is I, the mean of r is equal to 1. Yang suggested the gamma distribu- 
tion with parameters a and j3 as the distribution for r, and since the mean of the gamma 
distribution a/ (3, must be equal to 1, a = /3 II28II . that is: 

r[a) 

Maximum likelihood models incorporating ASRV are statistically superior to those 
assuming among site rate homogeneity an. They also help avoiding the severe under- 
estimation of long branch lengths that can occur with the homogeneous models [HI. 

In this study we use the discrete gamma model with k categories whose means 
are ri , . . . , rfc to approximate the continuous gamma distribution an. The categories 
are selected so that the probabilities of r falling into each category are equal. We thus 
assume that Pr{r = ri) = 1/k. 

The incorporation of the discrete gamma model in our test is straightforward. For 
each rate category i we calculate both the expected and observed chemical distance, 
given that the rate is r^. This is equivalent to making the computation in the homoge- 
neous case, where all branch lengths are multiplied by the factor r^. The observed and 
expected chemical distance for each branch are then averaged over all rate categories. 



4 Biological Results 

In order to validate our approach, we applied it to two control datasets: Class 1 major- 
histocompatibility-complex (MHC) glycoproteins, and carbonic anhydrase I. We have 
chosen to analyze these datasets since they were already used as standard positive con- 
trol (MHC) and negative control (carbonic anhydrase) for positive selection tests 1 1241 . 

The datasets contain aligned sequences (all sequences are of the same length, and 
the best alignment is gapless). Phylogenetic trees were constructed using the MOLPHY 
software IDl, with the neighbor-joining method 1^ for MHC class I, and with the 
maximum likelihood method for carbonic anhydrase I. The reason for the use of two 
tree construction methods is that in the MHC case we are dealing with 42 sequences and, 
therefore, an exhaustive maximum likelihood approach is impractical. Branch lengths 
for each topology were estimated using the maximum likelihood method [H with the 
ITT stochastic model o. assuming that the rate is discrete gamma distributed among 
sites with 4 rate categories. 
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Fig. 2. A Phylogenetic Tree for MHC Class I Sequences. Species labels are as in m- 
The tree topology was estimated by using whole sequences. Branch lengths were esti- 
mated for the cleft subsequences only. Each branch was subjected to the positive selec- 
tion test on the cleft subsequences. Branches in bold-face indicate p-value< 0.01. 
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4.1 MHC Class I 

The primary immunological function of MHC class I glycoproteins is to bind and 
“present” antigenic peptides on the surface of cells, for recognition by antigen-specific 
T cell receptors. MHC class 1 glycoproteins are expressed on the surface of most cells 
and are recognized by CD8-positive cytotoxic T cells, an essential step for initiating 
the elimination of virally infected cells by T-cell mediated lysis. These molecules are 
very polymorphic, and it was claimed that this polymorphism is the result of positive 
Darwinian selection that operates on the antigen-binding cleft iB- Using pairwise com- 
parisons of sequences, it was shown that the proportion of nonsynonymous differences 
in the antigen-binding cleft that cause charge changes was significantly higher than the 
proportion that conserve charge. This suggests that peptide binding is at the basis of the 
positive selection acting on these loci Q. 

Following |'9| we analyzed 42 human MHC class I sequences from three allelic 
groups: HLA-A, -B, and -C loci. Most of these sequences are not available in Genbank, 
and were copied from Parham et al. | ll 8 |. The length of each MHC class I sequence is 
274 amino acids. The binding site is a subsequence of 29 residues [11 8|. The phyloge- 
netic tree for MHC class I sequences is given in Figure^ The a parameter found for this 
tree was 0.24. When our clade-based test was applied to the whole tree, no indication 
for positive selection was found. The respective z-scores are shown in Table [D 



Table 1. A list of z-scores for each of the tests performed on the MHC class I dataset. 
The first row contains scores with respect to whole sequences. The second row contains 
results with respect to the binding cleft subsequences, with branch lengths as for the 
whole sequences. The third row contains results with respect to the binding cleft subse- 
quences, with branch lengths reestimated on this part of the sequence only. Significant 
z-scores (p-value< 0.01) appear in bold-face. 



Dataset/Distance 


Grantham 


Charge 


Grantham 


’olarity 

Hughes el al. 0 


Whole 


-1.30 


0.01 


-1.25 


1.10 


Cleft 


9.38 


9.32 


13.23 


5.79 


Cleft & cleft-based lengths 


1.08 


3.14 


2.78 


0.01 



When we applied our test to the binding site only, positive selection was found with 
very high confidence (P < 0.001). The respective z-scores are shown in Table Q1 How- 
ever, it might be argued, that when only the binding site part of the sequence is analyzed, 
the branch lengths estimated for the whole sequences are irrelevant. Since it is known 
that the rate of evolution in the binding site is faster relative to the rest of the sequence, 
the branch lengths estimated from the whole sequences are underestimated. This under- 
estimation can result in a false positive conclusion of positive selection, since we expect 
in this case an excess of radical replacements. To overcome this problem, branch lengths 
were reestimated on the binding site part of the sequence only. Significant excess of po- 
lar and charge replacements were found also with these new estimates (P < 0.01). The 
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corresponding z-scores are shown in Table QJ We note, that using the 0-1 polarity dis- 
tance of t§|, we found no evidence for positive selection. On the other hand, when we 
used Grantham’s polarity indices |0|, significant deviations from the random expecta- 
tions were observed (see Table QJ. The latter distance measure is clearly more accurate 
since it is not restricted to 0-1 values. We conclude that there is a significant excess in 
both charge and polar replacements, and not only in charge replacements, as reported 
in O. 

Finally, we tested specific branches in the tree to find those branches which con- 
tribute the most to the excess of charge replacements. Branches whose corresponding 
p-value was found to be smaller than 0.01 appear in bold-face in Figure El We note, 
that since we have no prior knowledge of which branches are expected to show excess 
of charge replacements, these p-values should be scaled according to the number of 
branches tested. Nevertheless, these high scoring branches lie all in the subtrees cor- 
responding to the A and B alleles, matching the findings of Flughes et al. who report 
positive selection for these alleles only 

4.2 Carbonic Anhydrase I 

This dataset comprises of 6 sequences of the carbonic anhydrase I house-keeping gene, 
for which there is no evidence of positive selection m- The carbonic anhydrase I se- 
quences were the same as in lEl, except that amino-acid sequences were used instead 
of nucleotide sequences. Sequence accession numbers are: JN0835 (Pan troglodytes), 
JN0836 (Gorilla gorilla), P00915 (Homo sapiens), P35217 (Macaca nemestrina), 
P48282 (Ovis aries) and P13634 (Mus musculus). The maximum likelihood estimate 
of the a parameter for this dataset was 0.52. 

When analyzing carbonic anhydrase I sequences, no evidence for positive selection 
was found. This was true, irrespective of the distance measures we used: Grantham (z- 
score= 0.01), Grantham’s polarity (z-score=— 1.04), Hughes et al. polarity (z-score= 
—0.49), and charge (z-score= —1.73). 

5 Discussion 

Natural selection may act to favor amino acid replacements that change certain prop- 
erties of amino acids m- Here we propose a method to test for such selection. Our 
method takes into account the stochastic model of amino-acid replacements, among 
site rate variation and the phylogenetic relationship among the sequences under study. 
The method is based on identifying large deviations of the mean observed chemical 
distance between two proteins from the expected distance. Our test can be applied to 
a specific branch of a phylogenetic tree, to a clade in the tree or, alternatively, over 
all branches of the phylogenetic tree. The calculation of the mean observed chemical 
distance is based on a novel procedure for averaging the chemical distance over all pos- 
sible ancestral sequence reconstructions weighted by their likelihood. This results in an 
unbiased estimate of the chemical distance along a branch of a phylogenetic tree. The 
underlying distribution of this random variable is calculated using the JTT model, tak- 
ing into account among site rate variation. We give a linear time algorithm to perform 
this test for all branches and subtrees of a given phylogenetic tree. 
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Two variants of the test are presented: The first is a statistical test of a single branch 
in a phylogenetic tree. Positive selection along a tree lineage can be the result of a spe- 
cific adaptation of one taxon to some special environment. In this case, the branch in 
question is known a priori, and the branch-specific test should be used. Alternatively, if 
the selection constraints are continuous, as for example, the selection that promotes di- 
versity among alleles of the MHC class I, the test should be applied to all the sequences 
under the assumed selection pressure - a clade-based test. 

We validated our method on two datasets: Carbonic anhydrase I sequences served 
as a negative control, and the cleft of MHC class I sequences as a positive control. MHC 
class I sequences were previously shown to be under positive selection pressure, acting 
to favor amino-acid replacements that are radical with respect to charge. 

There are, however, some limitations to our method. The method relies heavily on 
an assumed stochastic model of evolution. If this model underestimates branch lengths, 
one might get false positive results. It is for this reason that it is important to estimate 
branch lengths under realistic models, taking into account among site rate variation. 
Furthermore, if the test is applied to specific parts of the protein, such as an alpha helix, 
a replacement matrix that is specific for this part might be preferable over the more gen- 
eral ITT model used in this study (see One might claim that if excess of, say, polar 
replacements is found, it should not be interpreted as indicative of positive selection, but 
rather, as an indication that a more sequence-specific amino-acid replacement model is 
required. In MHC class I glycoproteins, however, other lines of evidence suggest 

positive Darwinian selection. 

In the future, we plan to make the test more robust by accommodating uncertainties 
in branch lengths and topology. This can be achieved by Markov-Chain Monte-Carlo 
methods |@. The sensitivity of our test to different assumptions regarding the stochastic 
process and the phylogenetic tree will be better understood when more datasets are 
analyzed. 
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Abstract. In this paper, we consider the problem of computing a maximum com- 
patible tree for k rooted trees when the maximum degree of all trees is bounded. 

We show that the problem can be solved in time, where n is the 

number of taxa in each tree, and every node in every tree has at most d chil- 
dren. Hence, a maximum compatible tree for k unrooted trees can be found in in 

Q(22fed^fe+l) 

1 Introduction 

A “phylogeny” (or evolutionary tree) represents the evolutionary history of a set of 
species S' by a rooted (typically binary) tree in which the leaves are labelled with ele- 
ments from S, and unlabelled internal nodes. 

For a variety of reasons, a typical outcome of a phylogenetic analysis of a dataset 
can consist of many different unrooted trees, and each tree represents an equally believ- 
able estimate of the true tree. Making sense of the set of these trees is then a challenging 
prospect. 

There are two basic approaches that have been used for this problem. The first ap- 
proach (and the most popular) represents the set of trees by a single tree on the full 
dataset (i.e. the “consensus tree”). Consensus tree techniques such as “strict consen- 
sus” and “majority consensus” are the most popular, and have the advantage that they 
are polynomial time. However, consensus tree methods have the disadvantage that they 
tend to create consensus trees that lack resolution (and this lack of resolution can be sig- 
nihcant), meaning that the trees can contain high degree nodes. An alternate approach 
seeks a subset of the taxa on which all the trees agree (the “maximum agreement sub- 
set” (MAST) approach), which means that when restricted to this subset of the taxa, the 
trees are identical. By contrast with the consensus tree approach, the agreement subset 
approach is computationally intensive (unless at least one of the trees has low degree); 
furthermore, the size of the maximum agreement subset can be very small O- 

In this paper we consider a different approach to this problem, in which we seek 
the largest subset of taxa on which all the trees are “compatible” (meaning that when 
restricted to this subset of taxa, the trees share a common refinement). The problem is 
suited for the case where the set of trees contains unresolved trees (such as can happen 
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when returning a set of consensus trees, one for each phylogenetic island obtained dur- 
ing a search for the maximum parsimony or maximum likelihood tree). In such cases, 
it may return a larger subset of taxa than the maximum agreement subset approach, as 
illustrated in Figure QJ This “maximum compatible set” (MCS) problem is NP-hard for 





MAST(T1, T2) 




Fig. 1. The MCT of T1 and T2 has more leaves than the MAST. 



6 or more trees We will denote by MCT the tree constructed on the MCS, and call 
it the maximum compatible tree. Occasionally, when the context is clear, we will use 
MAST to also denote the maximum agreement subtree. 

Our result for the MCS problem is an algorithm for the two-tree MCS problem 
which runs in time 0{2^‘^n^) time, where n = [S'! and d is the maximum degree of 
the two unrooted trees. (The algorithm we present is a dynamic programming algo- 
rithm and is an extension of the earlier dynamic programming algorithm for the two- 
tree MAST problem in [0|.) Thus, we show that the two-tree MCS problem is fixed- 
parameter tractable. We extend this algorithm to the A:-tree MCS problem in which all 
trees have bounded degree, obtaining a 0(2 algorithm. 

The organization of the paper is as follows. In Section Elwe give some basic defi- 
nitions (we will introduce and explain other terminology as needed). We then present 
the dynamic programming algorithm for two-tree MAST, and discuss the challenges in 
extending the algorithm to the two-tree MCS problem. In Section 0 we present the dy- 
namic programming algorithm for the two-tree MCS problem, the proof of correctness, 
and running time analysis. We then show how we extend this algorithm to the k tree 
case. In Section^we discuss further work in the area. 



2 Computing a Maximum Agreement Subtree of Two Trees 

In this section, we first present some definitions and then we describe the dynamic pro- 
gramming algorithm for computing the maximum agreement subtree of two trees, as 
given in [0. The algorithm actually computes the cardinality of the maximum agree- 
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ment subset but can be easily extended to give the maximum agreement subset (or 
equivalently the maximum agreement subtree). 

Definition 1. Given a leaf-labelled tree T on a set S, the restriction ofT to a set X C S 
is obtained by removing all leaves in (S — X) and then suppressing all internal nodes 
of degree two. This restriction will be denoted by T\X. 



Definition 2. Given a set S = {Ti, T 2 , . . . , T^} of leaf-labelled trees on the set S, 
a maximum agreement subset for the set of trees S is a set X of maximum cardinality 
such that the trees are all isomorphic. (The isomorphism must map leaves with the 
same label to each other). 



Definition 3. Given a set E = {Ti, T 2 , ... , T^} of leaf-labelled trees on the set S, a 
maximum compatible subset for the set of trees E is a set X of maximum cardinality 
such that the trees Ti\X all share a common refinement. 

We will first assume that the two trees T and T' are both rooted (we will later discuss 
how to extend the algorithm for unrooted trees). The trees can have any degree, but both 
are leaf-labelled by the same set S of taxa. 

Let n be a node in T, and denote by the subtree of T rooted at v. Similarly 
denote by Tf the subtree of T' rooted at a node w in T' . Denote by MAST(T„, Tf) 
the number of leaves in a maximum agreement subset of T„ and Tf. The dynamic 
programming algorithm operates by computing MAST(T„, Tf) for all pairs of nodes 
(n, w) in V{T) x V{T'), “bottom-up”. 

We now describe the basic idea of the dynamic programming algorithm. First, the 
value MAST(T„, Tf) is easy to compute when either v or w are leaves. Now consider 
the computation of MAST(T„, Tf) where both v and w are not leaves, and let AT be a 
maximum agreement subset of T„ and Tf. The least common ancestor of X in may 

be V, or it may be a node below v. Similarly, the least common ancestor of X raTf 
may be w or it may be a node below w. We have four cases to consider: 

1. If the least common ancestor of X is below w in T and similarly below w mT' , 
then \X\ = MAST(r!j, Wj) for some pair {vi, Wj) of nodes where Vi is a child of v 
and Wj is a child of w. 

2. If the least common ancestor of X is below n in T but equal to w in T', then 
\X\ — MAST(?;i, w) for some child Vi of v. 

3. If the least common ancestor of X is below w in T' but equal to v in T, then 
I AT I = MAST(r!, Wj) for some child Wj of w. 

4. If the least common ancestor of X is w in T and w in T', then X is the disjoint 
union of Xi, X 2 , . . . , Xp where each Xi is a maximum agreement subset for some 
pair of subtrees and (in which Va is a child of v and Wb is a child of w). 
Furthermore, for each i and j, the pairs of subtrees associated to Xi and Xj are 
disjoint. Hence X is the maximum value of a matching in the weighted complete 
bipartite graph G(A, B, E) in which A = {rii, ^ 2 , . . . , vj} (for the children of v), 
B = {wi, W 2 , . . . , Wp} (for the children of w), and the weight of edge {vi, Wj) is 

MAST(t!i, Wj). 
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This discussion suggests a straightforward dynamic programming algorithm which in- 
volves computing O (n ^ ) subproblems , each of which involves computing the maximum 
of a number of 0(d) values (where d is the maximum degree of any node in both T and 
T'). Each of these values in turn is easy to compute, though the maximum weighted 
bipartite matching of an 0(d) vertex graph takes 0(d^'® log d) time 0j|. The running 
time analysis of the MAST algorithm given in Hi shows it is 0{n^'^ log n) if d is not 
bounded, but O(n^) if d is bounded. 



3 Algorithms for the MCS Problem 

3.1 Relation between MCS and MAST 

We begin by observing the following: 

Lemma 1. Let T\ and T 2 be two unrooted leaf-labelled trees. Let X be an MCS ofT 
and T' . Then there exists a binary tree T{ refining T\ and a binary tree T 2 refining T 2 
such that X is a MAST ofT[ and T^. 

Proof. Since A is a compatible subset of taxa for the pair of trees Ti and T 2 , there is a 
common refinement T* of T\\X and T 2 1 A. Hence we can refine Ti, obtaining T{, and 
rehne T 2 , obtaining T 2 , so that T{ restricted to A yields T*, and similarly restricted 
to A also yields T*. Then A is an agreement subset of T[ and Tf 

This observation is illustrated in Figures ElandOl This observation suggests an obvious 




4 





Fig. 2. The MCT of Trees TI and T2. 



algorithm for computing an MCS of two trees: for each way of rehning the two trees into 
a binary tree, compute a MAST. However, this algorithm is computationally expensive, 
since the number of binary refinements of an n leaf tree with maximum degree d can 
be 0(4"'^). Hence this brute force algorithm will not be acceptably fast. 
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Fig. 3. The MAST of Trees T3 and T4. 



3.2 Computing an MCS of Two Rooted Trees 

We now describe the dynamic programming algorithm for the maximum compatible set 
(MCS) of two rooted trees, both with degree bounded by d (by this we mean that every 
node has at most d children). As for the MAST problem, here too the algorithm com- 
putes the cardinality of an MCS, but can be easily extended to compute an MCS itself. 
This algorithm can easily be extended to produce a dynamic programming algorithm 
for computing an MCS of two unrooted trees, by computing an MCS of each of the n 
pairs of rooted trees (obtained by rooting the unrooted trees at each of the n leaves). 
Furthermore, we also show how to extend the algorithm to handle k rooted or unrooted 
trees. 

The basic set of problems we need to compute must include the computation of an 
MCS of subtrees T„ and for every pair of nodes v and w. (Ty denotes the subtree of 
T rooted at v, and similarly T(y denotes the subtree of T' rooted at w.) We will also need 
to include the computation of MCS’s of other pairs of trees, but begin our discussion 
with these MCS calculations. 

Let T and T' be two rooted trees, and let v and w denote nodes in T and T' respec- 
tively. Let the children of v he vi,V2, ■ ■ ■ ,Vp and the children of ir; be wi, W 2 , . . . , Wg. 
Let X be the set of leaves involved in an MCS T* of Ty and T^. Note that T\X and 
T'\X will only include those children of v and w which have some element(s) of X 
below them. Let A be the children of v included in T\X and B be the children of w 
included in T'\X. (Note that X defines the sets A and B.) 

Note also that any MCS of T and T' actually defines an agreement subset of some 
binary refinement of T and some binary refinement of T ' (Lemma 1). Hence, T* defines 
a binary refinement at the node u if |A| > 1, and a binary refinement at the node w if 
\B\ > 1. In these cases, T* defines a partition of the nodes in A into two sets, and a 
partition of the nodes in B into two sets. 

There are four cases to consider: 
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1. |A| = \B\ = 1, i.e A = {tti} and B = {w^} for some i and j. In this case, any 



MCS of Ty and is an MCS of T„, and . . 

2. \A\ = 1 and \B\ > 1, i.e, any MCS of and is an MCS of T^, and T'^ for 



3. |A| > 1 and \B\ = 1, i.e, any MCS of T„ and T!^ is an MCS of T„ and for 



4. \A\ > 1 and \B\ > 1. 

The analysis of the fourth case is somewhat complicated, and is the reason that we need 
additional subproblems. Recall that T* defines a bipartition of A into (A', A — A') and 
B into (S', B — B'). Further, recall that T* is a binary tree with two subtrees off the 
root; we call these subtrees T\ and T 2 - It can then be observed that Ti is an MCS of 
the subtree of Ty obtained by restricting to the nodes below A — A' and the subtree 
of Tly obtained by restricting to the nodes below B — B' . Similarly, T 2 is an MCS 
of the subtree of Ty obtained by restricting to the nodes below A' and the subtree of 
T^y obtained by restricting to the nodes below B'. Hence we need to define additional 
subproblems as follows. For each A' C A define fhe tree T{v, A') to be subtree of Ty 
obtained by deleting all the children of v (and their descendents) not in A'. Similarly 
define the tree T'{w, B') to be the subtree of obtained by deleting all the children 
of w (and their descendents) not in B' . The construction of tree T{y, A') is illustrated 
in Figure^ Now define MCS{v, A', w, B') to be the size of an MCS of T{v, A') and 



some i. 



some j . 




A’ 



Fig. 4. The Tree T, the Set A’ and the Tree T(v, A’). 



T'(w, i?' From the above discussion it follows that: 



MCS{v, A', w, B') is the maximum of: 

- max{MCS{v,A',Wj,Children{wj)) : Wj a child of w}, 

- max{MCS{vi,Children{vi),w, B') : Ui a child of u}. 



The size of a tree is taken to mean the number of leaves in the tree. 
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- max{MCS{v,A'',w,B”) + MCS{v,A' - A",w,B' - B") : A” and B" are 
non-empty proper subsets of A' and B' , respectively. }. 

The computation of these subproblems follows the obvious partial ordering, in which 
MCS{v, A, w, B) must follow MCS{v' , A' , w' , B') if both of the following condi- 
tions holds: 

- V lies above v' or [u = v' and A' C A], 

- w lies above w' or [w = w' and B' C i?]. 

The base cases, in which u is a leaf or w is a leaf, are easy to compute, and equal 1 or 

0. For example, if u is a leaf then necessarily A = %, and so MCS{v, 0, w,B) = 1 if 
V G T'{w, B), and 0 otherwise. 

Running time analysis There are 0{2‘^n) trees T{y,A), and hence 0(2^‘^n^) sub- 
problems. The computation of MCS{v, A,w, B) involves computing the maximum 
of 2d + 2'^’^ values, and hence takes 0{2‘^'^) time. Hence the running time is 0(2^‘^n^). 

3.3 Algorithm for the MCS Problem of k Rooted Trees with Bounded Degree 

We now show how to extend the analysis to k rooted trees. In this case, the subproblems 
are 2fc-tuples of the form MCS{vi, Ai, U 2 , A2, . . . , Vk, Ak) where Vi is a node in Ti 
and Ai C Children{vi). Hence there are 0{2^^n^) subproblems. Computing each 
subproblem involves taking the maximum of 0{kd + 2 values. Hence the running 
time for the algorithm is time. 

3.4 Extension to Unrooted Trees 

For each of the algorithms described, extending the algorithm to handle unrooted trees 
involves rooting each of the trees at each of the n leaves (for each leaf all the trees are 
rooted at that leaf), and then finding the best rooting. This is based on the observation 
that given a leaf I, the size of an MCS of the trees rooted at I is the maximum size of a 
compatible set for the unrooted trees that includes I (while rooting the trees at the leaf 

1, I itself must be excluded from the trees. Hence the size of a maximum compatible set 
for the unrooted trees that includes I will be actually one more than the size of an MCS 
of the trees rooted at 1). Since there are n leaves, this multiplies the running time by n. 

Theorem 1. Wfe can compute an MCS ofk unrooted trees in which each tree has degree 
at most d + I in time. 

4 Future Work 

To conclude we point out that many questions about the MCS problem remain unsolved. 
We know that MCS is NP-hard for 6 trees with unbounded degree, but we do not know 
the minimum number of trees for which MCS becomes hard. In particular, we do not 
know if MCS is NP-hard or polynomial for two trees. It also remains to be seen if there 
are any approximation algorithms for the problem, or exact algorithms when only some 
of the trees have bounded degree. 
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Abstract. This paper discusses a bit-vector implementation of an algorithm that 
computes an optimal sequence of reversals that sorts a signed permutation. The 
main characteristics of the implementation are its simplicity, both in terms of data 
structures and operations, and the fact that it exploits the parallelism of bitwise 
logical operations. 



1 Introduction 

For several good reasons, the problem of sorting signed permutations has received a lot 
of attention in recent years. One of the attractive features of this problem is its simple 
and precise formulation; Given a permutation tt of integers between 1 and n, some of 
which may have a minus sign, 



TT = (tTi 7T2 . . . 7T„) 

find d(7r), the minimum number of reversals that transform tt into the identity permu- 
tation: 

(“t“l ~t“ 2 . . . -p Ti). 

The reversal operation reverses the order of a block of consecutive elements in tt, while 
changing their signs. 

Another good reason to study this problem is comparative genomics. The genome 
of a species can be thought of as a set of ordered sequences of genes — the ordering 
devices being the chromosomes — , each gene having an orientation given by its loca- 
tion on the DNA double strand. Different species often share similar genes that were 
inherited from common ancestors. However, these genes have been shuffled by muta- 
tions that modified the content of chromosomes, the order of genes within a particular 
chromosome, and/or the orientation of a gene. Comparing two sets of similar genes ap- 
pearing along a chromosome in two different species yields two (signed) permutations. 
It is widely accepted that the reversal distance between these two permutations, that is, 
the minimum number of reversals that transform one into the other, faithfully reflects 
the evolutionary distance between the two species. 

The last, and probably the best feature of the sorting problem from an algorithmic 
point of view, is the dramatic increase in efficiency its solution exhibited in a few years. 
From a problem of unknown complexity in the early nineties, a time when approximate 
solutions were plenty [S!|, polynomial solutions of constantly decreasing degree were 
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successively found: 0{n‘^) |E1, 0{n'^a{n)) Q, 0{n?) 0, and finally 0{n) 0, for the 
distance computation alone. A computer scientist’s delight, knowing that the unsigned 
version was proved to be NP-hard m- 

This high level of scientific activity inevitably generated undesirable side-effects. 
Complex constructions, useful in the initial investigations, were later proven unnec- 
essary. Terminology is not yet standard. For example, the overlap graphs of [0 are 
different from the ones in 0 . Most importantly, the complexity measures tend to mix 
two different problems, as pointed out in 0: the computation of the number of neces- 
sary reversals, and the reconstruction of one possible sequence of reversals that realizes 
this number. 

The first problem has an efficient and simple linear solution [ID] that can hardly 
be improved on. In this paper, we address the second problem with elementary tools, 
further developing the ideas of Id. We show that, with any problem of biologically 
relevant size, it is possible to implement efficient algorithms using the simplest data 
structures and operations: In this case, bit- vectors and standard logical and arithmetic 
operations. 

The next section presents a brief introduction to the current theory. It is followed by 
a discussion of the implementation, and results on simulated biological data. 

2 Sorting by Reversals 

The basic construction used for computing d(7r), the reversal distance of a signed per- 
mutation 7T, is the cycle graph Q associated with tt. Each positive element x in the 
permutation tt is replaced by the sequence 2x — 1 2x, and each negative element —x by 
the sequence 2x 2x — 1. Integers 0 and 2n+l are added as first and last elements. For 
example, the permutation 

TT = ( -2 -1 -h4 -h3 -f5 -f8 +1 -f6 ) 

becomes 

7t' = ( 0 4 3 2 1 7 8 5 6 9 10 15 16 13 14 11 12 17 ). 

The elements of tt' are the vertices of the cycle graph. Edges join every other pair of 
consecutive elements of tt', starting with 0, and every other pair of consecutive integers, 
starting with (0,1). The first group of edges, the horizontal ones, is often referred to as 
black edges, and the second as arcs or gray edges. 

Every connected component of the cycle graph is a cycle, which is a consequence 
of the fact that each vertex has exactly two incident edges. The graph of Figure 1 has 4 
cycles. 

The support of an arc is the interval of elements of tt' between, and including, 
its endpoints. Two arcs overlap if their support intersect, without proper containment. 
An arc is oriented if its support contains an odd number of elements, otherwise it is 
unoriented. Note that an arc is oriented if and only if its endpoints belong to elements 
with different signs in the original permutation. 

’ A cycle graph in which all cycles of length two have been removed is often called a breakpoint 
graph. 
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The arc overlap graph is the graph whose vertices are arcs of the cycle graph, and 
whose edges join overlapping arcsQ. The overlap graph corresponding to the cycle graph 
of Figure 1 is illustrated in Figure 2, in which each vertex is labeled by an arc (2i, 2i+l) . 
Oriented vertices — those for which the corresponding arc is oriented — are marked by 
black dots. Orientation extends to connected component in the sense that a connected 
component with at least one oriented vertex is oriented. It is easy to show that a vertex 
is oriented if and only if its degree is odd. 



( 0 , 1 ) 





Fig. 2. The Arc Overlap Graph of p. 



A hurdle is an unoriented component of the arc overlapping graph whose vertices 
are consecutive on the circle, except for isolated points [0, |2ll- The graph of Figure 2 
has one hurdle spanning the vertices (10, 11) to (16, 17). 

Hannenhalli and Pevzner have shown that the reversal distance of an unsigned 
permutation of length n is given by the formula; 

d(7r) = n + 1 — c + h + f 

where c is the number of cycles in the cycle graph, h is the number of hurdles, and / 
is a correction factor equals to 1 when there are at least 3 hurdles satisfying a particular 
condition. With the above formula, it is easy to compute the reversal distance of the 
permutation of Figure 1 and Figure 2 as 9 — 4+1 = 6. 

^ The cycle overlap graph is obtained in a similar way, by defining two cycles to overlap if they 
have at least two overlapping arcs. 



Experiments in Computing Sequences of Reversals 167 



Computing the distance can be done in linear time |]T|. However, this number 
alone gives no clue to the reconstruction of a possible sequence of reversals that realizes 

d(7r). 

Reconstructing a possible sequence of reversals raises two different problems. The 
first one is how to deal with unoriented components. This problem is solved by carefully 
choosing a sequence of reversals that merges these components while creating at least 
one oriented vertex in each component O.O’Q' The selection of these reversals can 
be done efficiently while computing d(7r). 

The second step, called the oriented sort, requires to choose, among several can- 
didates, a safe reversal, that is a reversal that decreases the reversal distance. Such a 
reversal always exists, but can be hard to find. For example, choosing the obvious re- 
versal of the first two elements in the permutation: 

( -2 -1 +4 -h3 ) 

for which d = 3, yields the permutation 

{+1+2 +4 -h3) 

that still has d = 3, since one hurdle was created by the reversal. On the other hand, the 
original permutation can be sorted by the sequence: 

( -2 -1 +4 -h3 ) 

( -4 +1+2 -h3 ) 

( -4 -3 -2 -1 ) 

( -hi -h2 -h3 +4 ) 

In general, to any oriented vertex of the overlap graph, there is an associated reversal 
that creates consecutive elements in the permutation, and the search for safe reversals 
can be restricted to reversals associated to oriented vertices. Several criteria have been 
proposed to select safe reversals. In El, the selection of a safe reversal is the most 
expensive iteration of the algorithm; Q reduces the complexity of the search by con- 
sidering only C>(log n) candidates; and [Q gives a characterization of a safe reversal in 
terms of cliques. In E), we showed that there is a much simpler way to identify a safe 
reversal: 

Theorem 1. The reversal that maximizes the number of oriented vertices is safe. 

In the following sections we discuss a bit- vector implementation of the oriented sort 
that uses this property. 

3 Implementing the Oriented Sort with Bit- Vectors 

The idea underlying the sorting algorithm is that the reversal corresponding to an ori- 
ented vertex v modifies the overlap graph as follows: it isolates v, complements the 
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subgraph of vertices adjacent to v, and reverses their orientation. Therefore, the net 
change in the number of oriented vertices depends only on the orientation of vertices 
adjacent to v. 

The score of a reversal associated to vertex v is defined by the difference between 
the number of its unoriented neighbors [/„, and the number of its oriented neighbors 
Oy. The score of a reversal is a “local” property of the graph, and this locality suggests 
the possibility of a parallel algorithm to keep the scores and to compute the effects of a 
reversal. 

For a signed permutation of length n, we will denote by bold letters the character- 
istic bit-vectors of subsets of the n + 1 arcs (0, 1) to (2n, 2n + 1). We will use only 
three operations on these vectors: the exclusive-or operator 0; the conjunction A; and 
the negation 

3.1 The Data Structure 

Given an overlap graph, we construct a bit-matrix in which each line u i is the set of 
adjacent vertices to arc (2i, 2i0 1). For example, the bit-matrix associated to the overlap 
graph of permutation ( 0 03 0l 06 05 —2 04 07 ) : 




is the following: 
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V2 


Vs 
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Vs 


Vo 


Vo 
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V2 
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Vs 
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1 
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1 
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0 


Vs 
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0 
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1 


0 


1 


Vo 


0 


1 


1 


1 


0 


1 


0 


P 


0 


1 


1 


0 


0 


0 


0 


s 


0 


1 


3 


2 


0 


2 


0 



The last two lines contain, respectively, the parity, or orientation, of the vertex, 
and the score of the associated reversal. We will discuss efficient ways to initialize the 
structure and to adjust scores in Sections 3.2 and 3.3. 

Given the vectors p and s, selecting the oriented reversal with maximal score is 
elementary. In the above example, vertex 2 would be the selected candidate. 





Experiments in Computing Sequences of Reversals 169 



The interesting part is how a reversal affects the structure. These effects are summa- 
rized in the following algorithm, which recalculates the bit-matrix v, the parity vector 
p, and the score vector s, following the reversal associated to vertex i, whose set of 
adjacent vertices is denoted by Vi. 



S ^ S + Vi 
Vi, ^ 1 

For each vertex j adjacent to i 
If j is oriented 

S ^ S + Vj 
Vn . ^ 1 
Vj ^ Vj 0 Vi 
S ^ S + Vj 

Else 



P 



P®Vi 



s 

Vi 



S-Vj 

1 



Vj 0 Vi 
S-Vj 



The logic behind the algorithm is the following. Since vertex i will become unori- 
ented and isolated, each vertex adjacent to i will automatically gain a point of score. 
Next, if j is a vertex adjacent to i, vertices adjacent to j after the reversal are either 
existing vertices that were not adjacent to i, or vertices that were adjacent to i but not 
to j. The exceptions to this rule are i and j themselves, and this problem is solved by 
setting the diagonal bits to 1 before computing the direct sum. 

If j is oriented, each of its former adjacent vertices will gain one point of score, 
since j will become unoriented, and each of its new adjacent vertices will gain one 
point of score. Note that a vertex that stays connected to w will gain a total of two 
points. For unoriented vertices, the gains are converted to losses. 

The amount of work done to process a reversal corresponding to vertex i, in terms 
of vector operations, is thus proportional to the number of adjacent vertices to vertex i. 



3.2 Representing the Scores 

The additions and subtractions to adjust the score vector are the usual arithmetic oper- 
ations performed component-wise. In order to have a truly bit-vector implementation, 
we represented the score vector as a [log(n)] x n bit-matrix, each column containing 
the binary representation of a score. With this representation, component-wise addition 
of a bit-vector to s can be realized with the following: 

For k from 1 to [log(n)] 

t ^ V 
V ^ vAsk 
Sf^ < f0Sfc 

Subtraction is implemented in a similar way. A side benefit of this structure is that 
the selection of the next reversal can be also done in parallel, by “sifting” the score 
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matrix through the parity vector. The set c of candidates contains initially all the ori- 
ented vertices. Going from the higher hit of scores to the lower, if at least one of the 
candidates has bit i set to 1, we eliminate all candidates for which bit i is 0. 

c ^ p 

i ^ riog(n)] 

While i > 0 do 

While (c A Si) = 0 
z <— i — 1 
Ifz > 0 

c ^ c A Si 
z <— z — 1 

At the end of the loop, c is the set of oriented vertices of maximal score. 

3.3 Initializing the Data Structure 

We saw, in Section 2, that the overlap graph of a signed permutation tt = (tt i 7T2 . . . 7 t„) 
contains zz + 1 vertices corresponding to the arcs joining 2z and 2z + 1 in the equivalent 
unsigned permutation. In this section, we will construct a representation of the overlap 
graph without explicitly referring to the unsigned permutation, thus removing one more 
step between the actual algorithm, and the original formulation of the problem. 

The construction is based on the following simple lemma. Let / be a set of intervals 
with distinct endpoints in an ordered set S. Let z = (6 z , ) and j = {bj, Cj ) be intervals 

in I. Dehne the sets li and r* as follows: 

li — j J G / I bj ^ €i ^ Bj } 

Vi = {j G I I bj <bi< Bj} 



We have the following: 

Lemma 1. The set Vi of intervals that overlap i in I is given by: Vi = U® ?"i. 

Starting with a signed permutation tt = (tti 7T2 . . . 7t„), we hrst read the elements 
from left to right. Let a represent the set of arcs — i,e., vertices — for which exactly one 
endpoint has been read. Initially, a is the set {0}, corresponding to the arc (0, 1). When 
element Wi is read, we have to process two arcs: (27Ti — 2, 2TTi — 1) and (27Ti, 2TTi + 1). 
In increasing order, if tt^ is positive, and decreasing order, otherwise. Processing an arc 
{2j, 2j + 1) is done by the following instructions: 

If aj = 0 

Then aj ^ 1 (* First endpoint of arc {2j, 2j + l) *) 

Else (* Second endpoint *) 

% ^ 0 

Vj ^ a (* a is the set Ij *) 

We then repeat the process in the reverse order, reading the permutation from right 
to left, initializing a to the set {n}, and changing the last instruction to Vj ^ Vj 0 a. 
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4 Analysis and Performances 

The formal analysis of the algorithm of Section 3 raises interesting questions. For ex- 
ample, what is an elementary operation? Except for a few control statements, the only 
operations used by the algorithm are very efficient bit-wise logical operators on words 
of size w — typically 32 or 64, depending on implementation. The most expensive in- 
structions in the main loop are additions and subtractions, such as 

S ^ S + Vj, 

where s is a bit matrix of size n log(n), and Vj is a bit vector of size n. Such an opera- 
tion requires a total of (2n log(n))/m elementary operations with the loop described in 
Section 3.2. Hopefully, log(n) is much smaller than w, and, in the range of biologically 
meaningful values, n is often a small multiple of w. In the actual implementation, the 
loop is controlled by the value of log(maximal score) which tends to be much less than 
log(n). We thus have a, very generous, 0{n) estimate for the instructions in the main 
loop. 

The overall work done by the algorithm depends on the total number v of vertices 
adjacent to vertices of maximal score. We can easily bound it by n^, noting that the 
number d of reversals needed to sort the permutation is bounded by n, and the degree 
of a vertex is also bounded by n. We thus get an 0{n^) estimate for the algorithm, as- 
suming that log n < w. However, experimental results suggest that v is better estimated 
by 0{n), at least for values of n up to 4096, which seems largely sufficient for most 
biological applications. 

The value of v is hard to control experimentally, but if we write the quantity gov- 
erning the running time as dvmn, in which Vm is the mean number of adjacent vertices, 
then both d and n can be fixed in an independent way. 

4.1 Experimental Setup 

In the following experiments, we generated sets of simulated genomes of various lengths 
n, by applying k random reversals to the identity permutation. The parameter k is of- 
ten called the evolution rate. The — almost nonexistent — permutations that contained an 
oriented component were rejected from the compilations. 

The code is written in C, and is quite “bare”. For example, loops controlled by 
bit-vectors are implemented with right shifts and tests of bit values. The tests were 
conducted on an 800MHz Pentium 3. 



4.2 Speed and Range 

The hrst observation is that the implementation is very fast for typical biological prob- 
lems. With n = 128, and k = 32, we can compute 10,000 sequences of reversals in 1 Is. 
On the other end of the spectrum, we applied the algorithm to values up to n = 4096. 
In this case, the computation of reversal sequences for fc = 512 took a mean time of 
3.86s for 100 random permutations. 




172 



Anne Bergeron and Franjois Strasbourg 



4.3 Effects of the Variation of n 

In order to study the effect of the variation of n on the running time, we choose four 
different evolution rates k = 32, 128, 256, and 512. For each value of k, we generated 
sets of 100 permutations of length ranging from 256 to 4,096. Figure 3 displays the 
results for the mean sorting time for k = 256 and k = 512 — values for smaller ks were 
too low for a signihcant analysis. 




Fig. 3. Mean Time to Sort a Permutation of Length n (in sec.). 



In this range of values, the behavior of the algorithm is clearly linear. Recall from 
the analysis that the estimated running time is governed by the quantity dvmn. With k 
constant and n sufficiently large, then d is also constant. It seems that for large n, the 
shape of the overlap graph depends only on d, which would be certainly be true in the 
limit case of a continuous interval. 



4.4 Effects of the Variation of d 

In this series of experiments, we studied the effect of the variation of d, the number of 
reversals, for a fixed n = 1,024. We generated sets of 500 permutations with evolution 
rates varying from /c = 64 to fc = 1, 024, with equal increments. 

For each set, we computed the mean number of reversals. Figure 4 presents the pairs 
of mean values {d, t). The fact that the points are closer together in the right part of the 
graph is called saturation: when k grows close to n, the value of d tends to stabilize. 
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Fig. 4. Mean Time vs. Number of Reversals. 

At least for the studied range of values, the performance of the algorithm on the 
value of d, for a fixed n, seems to be much less than which is a bit surprising, 

given that what is measured here is dvm- Factoring out d from the data yields the curve 
of Figure 5, which appears to be 0(log^(fi)). 




I ' I ' I ' I ' I ' I ' I ' I ' I 



0 128 256 384 512 640 768 896 102- 

Number of Reversals 



Fig. 5. (Mean Time)/(Number of Reversals) vs. Number of Reversals. 
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5 Conclusions 

Our goal is eventually to be able to study combinatorial properties of optimal sequences 
of reversals, and our algorithm has the power and performance to serve as a basic tool in 
such studies. But we think that one of the most desirable feature of the algorithm is its 
simplicity: the basic ideas can be implemented using only arrays and vector operations. 

The experimental running times are still a bit of a puzzle. The theoretical 0{n^) 
seems greatly overestimated for mean performances. The role of the parameter u m is 
still under study. 

In one experiment, we generated two sets of 50 000 permutations of length n = 256, 
with evolution rate k = 64. In the first group, we restricted the length of the random 
reversals to less than (l/3)n, and in the second group, to more than (l/3)n, under the 
hypothesis that shorter reversals would produce a less dense overlap graph. Indeed, the 
“short” group was sorted in 85% of the time of the “long” group. 
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Abstract. Evolution operates on whole genomes by operations that change the 
order and strandedness of genes within the genomes. This type of data presents 
new opportunities for discoveries about deep evolutionary rearrangement events, 
provided that sufficiently accurate methods can be developed to reconstruct evo- 
lutionary trees in these models I ' ' Q if- . A necessary component of any such 



method is the ability to accurately estimate the true evolutionary distance be- 
tween two genomes, which is the number of rearrangement events that took place 
in the evolutionary history between them. We improve the technique (lEBP) in 
with a new method, Exact-IEBP, for estimating the true evolutionary dis- 
tance between two signed genomes. Our simulation study shows Exact-IEBP is 
a better estimation of true evolutionary distances. Furthermore, Exact-IEBP pro- 
duces more accurate trees than lEBP when used with the popular distance-based 
method, neighbor joining | . 



1 Introduction 

Genome Rearrangement Evolution. The genomes of some organisms have a single 
chromosome or contain single chromosome organelles (such as mitochondria or 
chloroplasts whose evolution is largely independent of the evolution of the nu- 

clear genome for these organisms. Many single-chromosome organisms and organelles 
have circular chromosomes. Gene maps and whole genome sequencing projects can 
provide us with information about the ordering and strandedness of the genes, so the 
chromosome is represented by an ordering (linear or circular) of signed genes (where 
the sign of the gene indicates which strand it is located on). The evolutionary process 
on the chromosome can thus be seen as a transformation of signed orderings of genes. 
The process includes inversions, transpositions, and inverted transpositions, which we 
will define later. 

True Evolutionary Distances. Let T be the true tree on which a set of genomes has 
evolved. Every edge e in T is associated with a number ke, the actual number of rear- 
rangements along edge e. The true evolutionary distance (t.e.d.) between two leaves Gi 
and Gj in T is where is the simple path on T between Gi and 

Gj . If we can estimate all sufficiently accurately, we can reconstruct the tree T using 
very simple methods, and in particular, using the neighbor joining method (NJ) tair| . 
Estimates of pairwise distances that are close to the true evolutionary distances will in 
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general be more useful for evolutionary tree reconstruction than edit distances, because 
edit distances underestimate true evolutionary distances, and this underestimation can 
be very significant as the number of rearrangements increases . 

There are two criteria for evaluating a t.e.d. estimator: how close the estimated dis- 
tances are to the true evolutionary distance between two genomes, and how accurate 
the inferred trees are when a distance-based method (e.g. neighbor joining) is used in 
conjunction with these distances. The importance of obtaining good t.e.d. estimates 
when analyzing DNA sequences (under stochastic models of DNA sequence evolution) 
is understood, and well-studied 

Representations of Genomes. If we assign a number to the same gene in each genome, 
a linear genome can be represented by a signed permutation of {1, . . . , n} — a permu- 
tation followed by giving each number a plus or minus sign — where the sign shows 
which strand the gene is on. A circular genome can be represented the same way as 
a linear genome by breaking off the circle between two neighboring genes and choos- 
ing the clockwise or counter-clockwise direction as the positive direction. For exam- 
ple, the following are representations for the same circular genome: (1, 2, 3), (2, 3, 1), 
(—1, —3, —2). The canonical representation for a circular genome is the representation 
where gene 1 is at the first position with positive sign. The first representation in the 
previous example is the canonical representation. 

The Generalized Nadeau-Taylor Model. We are particularly interested in the following 
three types of rearrangements: inversions, transpositions, and inverted transpositions. 
Starting with a genome G = {gi, g 2 , . . . , gn) an inversion between indices a and b, 
1 <a<b<n+l, produces the genome with linear ordering 

(51 j 52 j ■ ■ ■ ) 9a— 9b — ■ ■ ■ ) 5a ) 9bi ■ ■ ■ i 9n) 

If 6 < a, we can still apply an inversion to a circular (but not linear) genome by simply 
rotating the circular ordering until ga precedes gj, in the representation — we consider 
all rotations of the complete circular ordering of a circular genome as equivalent. A 
transposition on the (linear or circular) genome G acts on three indices, a, b, c, with 
1 < a < 6 < n and 2 < c< n-|-l,c^ [a, and operates by picking up the interval 
9a, 9a+i, ■ ■ ■, 9b-i and inserting it immediately after 5c-i. Thus the genome G above 
(with the additional assumption of c > 5) is replaced by 

(sij ■ ■ ■ , 9a— 1, 9b, 9b+l, ■ ■ ■ , 9c— 1, 9a, 9a+l, ■ ■ ■ , 9b— 1, 9c, ■ ■ ■ , 9n) 

An inverted transposition is the combination of a transposition and an inversion on the 
transposed subsequence, so that G is replaced by 

( 5 I) ■ ■ ■ ) 9a— 1 , 9b, 9b+l, ■ ■ ■ , 9c— 1, 9b— ■ ■ ■ , 9a, 9c, ■ ■ ■ , 9n) 

The Generalized Nadeau-Taylor (GNT) model assumes a phylogeny (i.e. a 

rooted binary tree leaf-labeled by species) and models inversions, transpositions, and 
inverted transpositions along the edges. Different inversions have equal probability, and 
so do different transpositions and inverted transpositions. Each model tree has two pa- 
rameters: a is the probability a rearrangement event is a transposition, and [3 is the 
probability a rearrangement event is an inverted transposition. Hence, the probability 
for a rearrangement event to be an inversion is 1 — a — /3. The number of events on each 
edge e is Poisson distributed with mean Ag. This process produces a set of signed gene 
orders at the leaves of the model tree. 
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lEBP. The lEBP (Inverting the Expected Breakpoint distance) method estimates 
the true evolutionary distance hy approximating the expected breakpoint distance (see 
Section^ under the GNT model with provable error bound. The method can be applied 
to any dataset of genomes with equal gene content, and for any relative probabilities of 
rearrangement event classes. Moreover, the method is robust when the assumptions 
about the model parameters are wrong. 

EDE. In the EDE (Empirically Derived Estimator) method | n | we estimate the true 
evolutionary distance by inverting the expected inversion distance. We estimate the ex- 
pected inversion distance by a nonlinear regression on simulation data. The evolution- 
ary model in the simulation is inversion only, but NJ using EDE distance has very good 
accuracy even when transpositions and inverted transpositions are present. 

Our New t.e.d. Estimator. In this paper we improve the result in by introducing the 
Exact-IEBP method. The method replaces the approximation in the lEBP method by 
computing the expected breakpoint distance exactly. In Section^we show the deriva- 
tion for our new method. The technique is then checked by computer simulations in 
Section^ The simulation shows that the new method is the best t.e.d. estimator, and 
the accuracy of the NJ tree using the new method is comparable to that of the NJ tree 
using the EDE distances, and better than that of the NJ tree using other distances. 



2 Definitions 

We first define the breakpoint distance Q between two genomes. Let genome Gq = 
(pi, . . . , p„), and let G be a genome obtained by rearranging Gq. The two genes gi and 
gj are adjacent in genome G if gi is immediately followed by gj in Go, or, equivalently, 
if —gj is immediately followed by —gt. A breakpoint in G with respect to Go is defined 
as an ordered pair of genes {gt, gj) such that gi and gj are adjacent in Go, but are not 
adjacent in G (neither {gi, gj) nor {—gj , —gi) appear consecutively in that order in G). 
The breakpoint distance between two genomes G and Go is the number of breakpoints 
in G with respect to Go (or vice versa, since the breakpoint distance is symmetric). For 
example, let G = (1, 2, 3, 4) and let G' = (1, —3, —2, 4); there is a breakpoint between 
genes 1 and 3 in G' (w.r.t. G) but genes 2 and 3 are adjacent in G' (w.r.t. G). The 
breakpoint distance between two genomes is the number of breakpoints in one genome 
with respect to the other. 

A rearrangement p is a permutation of the genes in the genome, followed by either 
negating or retaining the sign of each gene. For any genome G, let pG be the genome 
obtained by applying p on G. Let Rj, Rt, Rv be the set of all inversions, transpositions, 
and inverted transpositions, respectively. We assume the evolutionary model is the GNT 
model with parameters a and j3. Within each of the three types of rearrangement events, 
all events have the same probability. 

Let Go = {gi, g 2 , ■ ■ ■ , g-n) be the signed genome of n genes at the beginning of the 
evolutionary process. For linear genomes we add the two sentinel genes 0 and n -|- 1 in 
the front and the end of Go that are never moved. For any k > 1, let pi , p 2 j ■ ■ ■ Pfc be fc 
random rearrangements and let Gfc = pkPk-i ■ ■ ■ PiGo (i.e. Gk is the result of applying 
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these k rearrangements to Gq)- Given any linear genome G = (0, , (?2) ■ ■ ■ ) 1)> 

where 0 and n + 1 are sentinel genes, we define the function (G) , 0 < z < n by setting 
Bi{G) = 0 if genes and are adjacent, and Bi{G) = 1 if not; in other words, 
Bi{G) = 1 if and only if G has a breakpoint between gi and gi+i- When G is circular 
there are at most n breakpoints Bi{G), 1 < i < n. We denote the breakpoint distance 
between two genomes G and G' by BP{G, G'). Let = Pr(i?i(Gfc) = 1); then 
E[BP{Go, Gfc)] = X^r=o ^i\k for linear genomes and E[BP{Gq, Gfc)] = X^r=i 
for circular genomes. 

3 The Exact-IEBP Method 

3.1 Derivation of the Exact-IEBP Method 

Signed Circular Genomes. We now assume that all genomes are given in the canonical 
representation. Under the GNT model for circular genomes, P^k has the same distri- 
bution for all z, 1 < * < zz. Therefore E[BP{Gq, Gk)] = nP^.. Let be the set of 
all signed circular genomes, and let = {±2, ±3, . . . , ±zz}. We define the function 
K : as follows: for any genome G € 5^, K{G) = x if g 2 is at position 

|a;| with the same sign of x. For example, in the genome G = (51, 53, 55, 54, —52) we 
have K{G) = —5. Since the sign and the position of gene 52 uniquely determine Pi\k, 
{K{Gk) : fc > 0} is a homogeneous Markov chain where the state space is . We 
will use these states for indexing elements in the transition matrix and the distribution 
vectors. For example, if M is the transition matrix for {K{Gk) '■ k > 0}, then Mij is 
the probability of jumping to state z from state j in one step in the Markov chain for all 

For every rearrangement p G RjU Rt U Ry, we construct the matrix Yp as follows: 
for every z, j S , {Yp)ij = 1 if p changes the state of gene 52 from j to z. We then 
let Mi = pipy Y^pdR, Yp, Mt = Y^paR^ Yp, and My = p^ YpaRv Yp- The 
transition matrix M for {K{Gk) : fc > 0} is therefore M = {1 — a — P)Mj + uMr + 
j3My. Let Xk be the distribution vector for Ff(Gfc), we have 

(3^0)2 = 1 

(a;o)i = 0, zGkFf,Z9^2 
Xk = M^xo 

E[BP{Go, Gfc)] = zzPiifc = zz(l - (a;fc)2) 

The result in is a special case where a = /3 = 0. 

Signed Linear Genomes. When the genomes are linear, we no longer have the luxury 
of placing gene gi at some fixed position with positive sign; different breakpoints may 
have different distributions. We need to solve the distribution of each breakpoint indi- 
vidually by considering the positions and the signs of both genes involved at the same 
time. Let be the set of all signed linear genomes, and let = {(zz, z;) : u,v = 

±1, . . . , ±zz, jzzj 7^ |z;|}. We define the functions Ji : Gn ^ * = 1, ■ ■ ■ , zz — 1, 

as follows: for any genome G G Ji{G) = {x,y) if gi is at position |a;| having 
the same sign of x, and gi+i is at position \y\ having the same sign of y. Therefore 
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{Ji{Gk) ■ k > 0}, 1 < z < n — 1 are n — 1 homogeneous Markov chains where the 
state space is W^. For example, in the genome G = {91,92, 94, 95, 5e, ~93, ~97, 5s) 
we have L^{G) = (—6,3) and Lj{G) = (—7,8). As before we use the states in 
as indices to the transition matrix and the distribution vectors. Let Xi^k be the 
distribution vector of Li{Gk)- For every rearrangement p G Rj,Rt, and Ry, Yp is 
defined similarly as before (for circular genomes), except the dimension of the ma- 
trix is different. We then let Mj = T,p(iR, Yp, Mt = T,p(iR^ Yp, and 
^peRv Yp- The transition matrix M has the same form as that for the 
circular genomes: M = {1 — a — P)Mj -|- aMp -|- (HMy- Let e be the vector where 
^{u,v) = lifu = u-|-l, and 0 otherwise (that is, = 1 if s is the state where the two 
genes are adjacent so there is no breakpoint between them). Therefore 

(^z,o) (z,z-t-l) — 1 

(a;i,o)(«,j;) = 0, {u, v) G W^, {u, v) ^{i,i + 1) 

^i,k — AT Xi^Q 

Pi\k = 1 - e^Xi,fc = 1 - e^M'^Xifi 

Since the two sentinel genes 0 and n + \ never change their positions and signs, their 
states are fixed. This means the distribution of the two breakpoints Bq and depend 
on the state of one gene each {91 and g„, respectively); we can use the method for 
circular genomes. Under the GNT model they have the same distribution. Then the 
expected breakpoint distance after k events is 

n n—1 n—1 

E[BP{Gn, Gfc)] = ^ Pi|fc = 2Po|fc + ^ P^\k = 2i"o|fc + ^(1 - e^M>^x,,o) 

i—0 i—1 i—1 

n—1 

= 2Po|fc + (n-l)-e^M'=^a:,.o 

i=l 

We now define the Exact-IEBP estimator k{G, G') for the true evolutionary distance 
between two genomes G and G': 

1. Eor all fc = 1, . . . , r (where r is some integer large enough to bring a genome to 
random) compute E[BP{Gij, Gk)] using the results above. 

2. To compute k' = k{G, G')(0 < k' < r),v/e 

(a) compute the breakpoint distance b = BP{G, G'), then 

(b) find the integer k',0 < k' < r such that \E[BP{Gq, Gk')] — b\ is minimized. 



3.2 The Transition Matrices for Signed Circular Genomes 

We now derive closed-form formulas of the transition matrix M for the GNT model 
on signed circular genomes with n genes. Let (“) denote the binomial coefficient; in 
addition, we let (“) = 0 if 5 > a. Eirst consider the number of rearrangement events in 
each class: 
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1. Inversions. By symmetry of the circular genomes and the model, each inversion has 
a corresponding inversion that inverts the complementary subsequence (the solid 
vs. the dotted arc in Figureja)); thus we only need to consider the ( 2 ) inversions 
that do not invert gene gi. 

2. Transpositions. In Figurejb), given the three indices in a transposition, the genome 
is divided into three subsequences, and the transposition swaps two subsequences 
without changing the signs. Let the three subsequences be A, B, and C, where A 
contains gene gi. A takes the form (^ 1 , gi, A2), where Ai and A2 may be empty. 
In the canonical representation there are only two possible unsigned permutations: 
(<7i, A2, B, C, Ai) and (gi, ^2, C', B, Ai). This means we only need to consider 
transpositions that swap the two subsequences not containing g\. 

3. Inverted Transpositions. There are 3(g) inverted transpositions. In Figure Jc), 
given the three endpoints in an inverted transposition, exactly one of the three 
subsequences changes signs. Using the canonical representation, we interchange 
the two subsequences that do not contain gi and invert one of them (the first two 
genomes right of the arrow in Figurejc)), or we invert both subsequences without 
swapping (the rightmost genome in Figurejc)). 

For all u,v G W^' , let Ln{u, v), Tn{u, v) and v) be the numbers of inversions, 
transpositions, and inverted transpositions that bring a gene in state u to state v in is the 
number of genes in each genome). Then 






(1 — a — P){Mj)u,v + Oi{MT)u,v + P{My)u,v 
1 cr /?, . ct , . (3 . . 

tn[U, V) + ■pyT„(u, V) + v) 



The following lemma gives formulas for v), Tn{u, v), and Vn{u, v). 

Lemma 1. 



min{|u| — 1, |t;| — 1, n + 1 — |u|, n + 1 — |f|}, if uv < 0 
Ln{u, f) = < 0, if u v,uv > 0 

0, if uv <0 

r„(u, n)=< (min{|u|, |r;|} - l)(n + 1 - max{|M|, |r;|}), if Uy^v,uv>0 

{n — 2)tn{u,v), if uv < 0 

l^n(u, v) = \ Tn{u, v) , if U V,UV > 0 

3Tn{u,v), if U = V 



Proof. The proof of (a) is omitted — this result is first shown in We now prove (b). 

Consider the gene with state u. Let v be the new state of that gene after the transposition 
with indices (a, &, c), 2 < a < 5 < c < n + 1. Since transpositions do not change the 
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(c) Inverted Transposition 

Fig. 1. The three types of rearrangement events in the GNT model on a signed circular 
genome, (a) We only need to consider inversions that do not invert gi. (b) A trans- 
position corresponds to swapping two subsequences, (c) The three types of inverted 
transpositions. Starting from the left genome, the three distinct results are shown here; 
the broken arc represents the subsequence being transposed and inverted. 



sign, t„(m, v) = Tn{—u, —v), and r„(zi, u) = 0 if urt < 0. Therefore we only need to 
analyze the case where u,v > 0. 

We first analyze the case when u = v. Assume that either a<u<bovb<u<c. 
In the first case, from the definition in Sectionjwe immediately have v = u + {c — b), 
therefore v — u = c — b>Q.\n the second case, we have v = u + {a — b), therefore 
V — u = a — b<0. Both cases contradict the assumption that u = v, and the only 
remaining possibilities that makes u = v are when 2<u = v<a or c<u = v<n. 
This leads to the third line in the r„(u, v) formula. Next, the total number of solutions 
(a, b, c) for the following two problems is r„(w, v) when v and u,v > 0: 

(i) u < V \ b = c — {v — u), 2<a<u<b<c<n+l,u<v<c. 

(ii) u > V \ b = a + {u — v), 2<a<b<u<c<n+l,a<v<u. 

In the first case r„(u, v) = {u — l)(n + 1 — rt), and in the second case r„(u, v) = 
(rt — l)(n + 1 — u). The second line in the r„(u, v) formula follows by combining the 
two results. 

For inverted transpositions there are three distinct subclasses of rearrangement 
events. The result in (c) follows by applying the above method to the three cases. 

3.3 Running Time Analysis 

Let m be the number of genomes and the dimension of the distance matrix. Since for 
every pair of genomes we can compute the breakpoint distance between them in linear 
time, computing the breakpoint distance matrix takes 0{m?n) time. Consider the value 
r, the number of inversions needed to produce a genome that is close to random; we 
can use this as an upper bound of k. In ^3 we showed by the simulation and the lEBP 
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formula that it is reasonable to set r = jn for some constant 7 larger than 1. We used 
7 = 2.5 for 120 genes in our experiment. 

Constructing the transition matrix M for circular genomes takes 0{v?) time by 
LemmaJ We believe results similar to Lemmajcan be obtained for linear genomes, 
though it is still an open problem. Instead, we use the construction in Section^Jfor 
linear genomes. For each rearrangement p, constructing the Yp matrix takes O(n^) time. 
Since there are O(n^) inversions and 0{n^) transpositions and inverted transpositions, 
constructing the transition matrix M takes 0{rJ) time. The running time for computing 
Xk in Exact-IEBP for fc = 1, . . .,r is 0{rn^) = 0{n^) for circular genomes and 
O(rn^) = 0(n®) for linear genomes by r matrix-vector multiplications. Since the 
breakpoint distance is always an integer between 0 and n, we can construct the array 
k{b) that converts the breakpoint distance b to the corresponding Exact-IEBP distance in 
O(n^) time. Transforming the breakpoint distance matrix into the Exact-IEBP distance 
matrix takes 0{m?) additional array lookups. 

We summarize the discussion as follows: 

Theorem 1. Given a set ofm genomes on n genes, we can estimate the pairwise true 
evolutionary distance in 

1. 0{m?‘n + n^) time using Exact-IEBP when the genomes are circular, 

2. 0{m^n nJ) time using Exact-IEBP when the genomes are linear, and 

3. 0{m^n min{n, m^} log n) time using lEBP (see ^9^- 

4 Experiments 

We now show the experimental study of different distance estimators. We compare 
the following five distance estimators on circular genomes: (1) BP, the breakpoint dis- 
tance between two genomes, ( 2 ) INV Q, the minimum number of inversions needed to 
transform one genome into another, (3) IEBP^9> ™ approximation to the Exact-IEBP 
method with fast running time, (4) EDE | ir] , an estimation of the true evolutionary 
distance based on the INV distance, and (5) Exact-IEBP, our new method. 

Software. We use PAUP* 4.0 ^3 compute the neighbor joining method and the false 
negative rates between two trees (which will be dehned later). We have Implemented 
a simulator for the GNT model. The input is a rooted leaf-labeled tree and the 

associated parameters (i.e. edge lengths, and the relative probabilities of inversions, 
transpositions, and inverted transpositions). On each edge, the simulator applies ran- 
dom rearrangement events to the circular genome at the ancestral node according to the 
model with given parameters a and f3. We use tgen 9 to generate random trees. These 
trees have topologies drawn from the uniform distribution, and edge lengths drawn from 
the discrete uniform distribution on intervals [a, 6 ] , where we specify a and b. 

4.1 Accuracy of the Estimators 

In this section we study the behavior of the Exact-IEBP distance by comparing it to 
the actual number of rearrangement events. We simulate the GNT model on a circular 
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Breakpoint Distance Inversion Distance EDE Distance 



(a) (b) (c) 




lEBP Distance Exact-IEBP Distance 



(d) (e) 

Fig. 2. Accuracy of the Estimators (See Section^J. The number of genes is 120. 
Each plot is a comparison between some distance measures and the actual number of 
rearrangements. We show the result for the inversion-only evolutionary model only. 
The x-axis is divided into 30 bins; the length of the vertical bars indicate the standard 
deviation. The distance estimators are (a) BP, (b) INV, (c) EDE, (d) lEBP, and (e) Exact- 
IEBP. The figures (a), (b), (d) are from and the figure (d) is from | n ■). 



genome with 120 genes, the typical number of genes in the plant chloroplast genomes 
0. Starting with the unrearranged genome Gq, we apply k events to it to obtain the 
genome Gk, where k = 1, . . . , 300. For each value of k we simulate 500 runs. We then 
compute the five distances. 

The simulation results under the inversion-only model are shown in Figure^ Under 
the other two model settings, the simulation results show similar behavior (e.g. shape 
of curves and standard deviations). Note that both BP and INV distances underesti- 
mate the actual number of events, and EDE slightly overestimates the actual number of 
events when the number of events is high. The lEBP and Exact-IEBP distances are both 
unbiased — the means of the computed distances are equal to the actual number of re- 
arrangement events — and have similar standard deviations. We then compare different 
distance estimators by the absolute difference in the measured distances and the actual 
number of events. Using the same data in the previous experiment, we generate the plots 
as follows. The x-axis is the actual number of events. For each distance estimator D we 
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Actual number of events 



Actual number of events 



Actual number of events 



(a) Inversions only 



(b) Transpositions only (c) Three types of events 

equally likely 



Fig. 3. Accuracy of the estimators by absolute difference (See Section^Jfor the de- 
tails.). We simulate the evolution on 120 genes. The curves of BP, INV, lEBP, and EDE 
are published previously in |Q; they are included for comparative purposes. 



plotthe curve / d, where ^ 0 ( 2 ;) is the mean of the set Gfc) — fc| : 1 < fc < a;} 

over all observations GfcB 

The result is in FigureH The relative performance is the same for most cases: BP 
is the worst, followed by INV, lEBP, and EDE. Exact-IEBP has the best performance 
except for inversion-only scenarios, where EDE is slightly better only when the num- 
ber of events is small. In most cases, lEBP has similar behavior as Exact-IEBP when 
the amount of evolution is small; the lEBP and Exact-IEBP curves are almost indistin- 
guishable in (a). Yet, in (b) and (c) the lEBP curve is inferior than the Exact-IEBP curve 
by a large margin when the number of events is above about 200. 



4.2 Accuracy of Neighbor Joining Using Different Estimators 

In this section we explore the accuracy of the neighbor joining tree under different ways 
of calculating genomic distances. See Tablejfor the settings for the experiment. 

Given an inferred tree, we compare its “topological accuracy” by computing “false 
negatives” with respect to the “true tree” Qf. We begin by defining the true tree. 
During the evolutionary process, some edges of the model tree may have no changes 
(i.e. evolutionary events) on them. Since reconstructing such edges is at best guesswork, 
we are not interested in these edges. Hence, we define the true tree to be the result of 
contracting those edges in the model tree on which there are no changes. 

We now define how we score an inferred tree, by comparison to the true tree. For 
every tree there is a natural association between every edge and the bipartition on the 
leaf set induced by deleting the edge from the tree. Let T be the true tree and let T' 

* The constant c is to reduce the bias effect in different distances. For the lEBP and the Exact- 
IEBP distances c = 1 since they estimate the actual number of events. For the BP distance we 
letc = 2(1 — a — /3)-|-3 (q-|-/3) = 2-\-a-\- fi since this is the expected number of breakpoints 
created by each event in the model when the number of events is very low. Similarly for the 
INV and EDE distances we let c = (1 — a — /3) -I- 3a -f 2/3 = 1 + 2a j3 since each 
transposition can be replaced by 3 inversions, and each inverted transposition can be replaced 
by 2 inversions. 
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Table 1. Settings for the Neighbor Joining Performance Simulation Study. 



Parameter 


Value 


1. Number of genes 


120 


2. Number of leaves 


10, 20,40, 80, and 160 


3. Expected number of 


Discrete Uniform within the following intervals: 


rearrangements in each edge 


[1,3], [1,5], [1,10], [3,5], [3,10], and [5,10] 


4. Probability settings: (a,/3)^ 


(0,0) (Inversion only) 

(1,0) (Transposition only) 

(|, |) (The three rearrangement classes are equally likely) 


5. Datasets for each setting 


100 



f The probabilities that a rearrangement is an inversion, a transposition, or an inverted transposi- 
tion are 1 — a — /3, a, and j3, respectively. 



be the inferred tree. An edge e in T is “missing” in T' if T' does not contain an edge 
defining the same bipartition; such an edge is called a false negative. Note that the 
external edges (i.e. edges incident to a leaf) are trivial in the sense that they are present 
in every tree with the same set of leaves. The false negative rate is the number of false 
negative edges in T' with respect to T divided by the number of internal edges in T. 

For each setting of the parameters (number of leaves, probabilities of rearrange- 
ments, and edge lengths), we generate 100 datasets of genomes as follows. First, we 
generate a random leaf-labeled tree (from the uniform distribution on topologies). The 
leaf-labeled tree and the parameter settings thus define a model tree in the GNT model. 
We run the simulator on the model tree, and produce a set of genomes at the leaves. 

For each set of genomes, we compute the five distances. We then compute NJ trees 
on each of the five distance matrices, and compare the resultant trees to the true tree. 
The results of this experiment are in FigureJ The x-axis is the maximum normalized 
inversion distance (as computed by the linear time algorithm for minimum inversion 
distances given in between any two genomes in the input. Distance matrices with 
some normalized edit distances close to 1 are said to be “saturated”, and the recovery of 
accurate trees from such datasets is considered to be very difficult Q. The y-axis is the 
false negative rate (i.e. the proportion of missing edges). False negative rates of less than 
5% are excellent, but false negative rates of up to 10% can be tolerated. We use NJ(D) 
to denote the tree returned by NJ using distance D. Note that except for NJ(EDE), the 
relative orders of the NJ tree using different distances are very consistent with the orders 
of the accuracy of the distances in the absolute difference plot. NJ(BP) has the worst 
accuracy in all settings. NJ(INV) outperforms NJ(Exact-IEBP) and NJ(IEBP) by a small 
margin only when the amount of evolution is low; in the transposition-only scenario 
the accuracy of NJ(INV) degrades considerably. NJ(Exact-IEBP) has slightly better 
accuracy than NJ(IEBP) until the amount of evolution is high; after that the accuracy 
of NJ(IEBP) degrades, and NJ(Exact-IEBP) outperforms NJ(IEBP) by a larger margin. 
Despite the inferior accuracy of EDE in the experiments in Section^] NJ using EDE 
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(a) Inversions only (b) Transpositions only (c) Three types of events 

equally likely 



Fig. 4. Neighbor Joining Performance under Several Distances (See Section^J. See 
Tablejfor the settings in the experiment. For comparative purposes, we include curves 
of NJ(BP), NJ(INV), NJ(IEBP) from Q, and the curve of NJ(EDE) from Q. 



returns the most accurate tree on averag^(especially in the inversion-only model), but 
the accuracy of the NJ tree using Exact-IEBP is comparable in most cases. 



4.3 Robustness to Unknown Model Parameters 

In this section we demonstrate the robustness of the Exact-IEBP estimator when the 
model parameters are unknown. The settings are the same in Table 1. The experiment is 
similar to the previous experiment, except here we use both the correct and the incorrect 
values of (a, /3) for the Exact-IEBP distance. The results are in Figure^ These results 
suggest that NJ(Exact-IEBP) is robust against errors in {a, /3). 



5 Conclusions 

We have introduced Exact-IEBP, a new technique for estimating true evolutionary dis- 
tances between whole genomes. This technique can be applied to signed circular and 
linear genomes with arbitrary relative probabilities between the three types of events 
in the GNT model. Our simulation study shows that the Exact-IEBP method improves 
upon the previous technique, lEBP both in the distance estimation and the accu- 
racy of the inferred tree when used in neighbor joining. The accuracy of the NJ trees 
using the new method is comparable with the best estimator so far, the EDE estimator 
EH . These different methods are simple yet powerful and can be generalized easily to 
different models. 



^ We do not have a good explanation for the superior accuracy of NJ(EDE) due to the fact that 
the behavior of NJ is still not well understood. 



Estimating Evolutionary Distances between Whole Genomes 



187 




(a) Inversions only (b) Transpositions only (c) Three types of events 

equally likely 



Fig. 5. Robustness of the Exact-IEBP Method to Unknown Parameters (See Section 
^3- See Tablejfor the settings in the experiment. The two values in the legend are the 
a and j3 values used in the Exact-IEBP method. The probability a rearrangement event 
is an inversion, a transposition, or an inverted transposition are 1 — a — (3, a, and /3, 
respectively. 
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Abstract. We derive a branch-and-bound algorithm to find an optimal inversion 
median of three signed permutations. The algorithm prunes to manageable size 
an extremely large search tree using simple geometric properties of the problem 
and a newly available linear-time routine for inversion distance. Our experiments 
on simulated data sets indicate that the algorithm finds optimal medians in rea- 
sonable time for genomes of medium size when distances are not too large, as 
commonly occurs in phylogeny reconstruction. In addition, we have compared 
inversion and breakpoint medians, and found that inversion medians generally 
score significantly better and tend to be far more unique, which should make 
them valuable in median-based tree-building algorithms. 



1 Introduction 



Dobzhansky and Sturtevant |]7| first proposed using the degree to which gene orders 
differ between species as an indicator of evolutionary distance that could be useful for 
phylogenetic inference, and Watterson et al. E3 first proposed the minimum number 
of chromosomal inversions necessary to transform one ordering into another as an ap- 
propriate distance metric. The 1992 study by Sankoff et al. I l21t included a heuristic 
algorithm for finding rearrangement distance (which considered transpositions, inser- 
tions, and deletions, as well as inversions); it was the first large-scale application and 
experimental validation of rearrangement-based techniques for phylogenetic purposes 
and initiated what is now nearly a decade of intense interest in computational problems 
relating to genome rearrangement (see summaries in 1 1161191221 '). 

While much of the attention given to rearrangement problems may be due to their in- 
triguing combinatorial properties, rearrangement-based approaches to phylogenetic in- 
ference are of genuine biological interest in cases in which sequence-based approaches 
perform poorly, such as when species diverged early or are rapidly evolving [EH. In ad- 
dition, rearrangement-based phylogenetic methods can suggest probable gene orderings 
of ancestral species liTm . while other methods cannot. Furthermore, mathematical 
models of genome rearrangement have applications heyond phylogeny (see [EEDl). 
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Recent work on rearrangement distance and sorting by inversions (or reversals, 
as they are often called in computer science) has produced a duality theorem and 
polynomial-time algorithm for inversion distance between two signed permutations 
HlQil, a duality theorem and polynomial-time algorithm for distance in terms of equally 
weighted translocations and inversions for signed permutations fill . polynomial-time 
algorithms for sorting by reversals [Ed. and a linear-time algorithm for computing in- 
version distances m . Note that “signed permutations” correspond to genomes for which 
the direction of transcription of each gene is known as well as the ordering of the genes. 

Much recent work on rearrangement-based phylogeny 1 1.51611 411 .'ill 81 stems from 
an algorithm by Sankoff and Blanchette d that iterates over a prospective tree, re- 
peatedly finds medians of the three permutations adjacent to each internal vertex, and 
uses them to improve the tree until convergence occurs. This method finds locally op- 
timal trees and simultaneously allows an estimation of the configurations of ancestral 
genomes. These studies have generally used breakpoint distance as the basis for finding 
medians, because it is more easily computable than inversion distance, it assumes no 
particular mechanism of rearrangement, and the problem of finding a breakpoint me- 
dian has a straightforward reduction to the well known Travelling Salesman Problem 
(TSP) ||17|. The number of breakpoints between two genomes is the number of genes 
that are adjacent in one but not the other genome; the breakpoint median of a set of 
genomes is the ordering of genes that minimizes the sum of the number of breakpoints 
with respect to each genome in the set. 

Breakpoint distance is related to inversion distance (an inversion can remove at most 
two breakpoints) but the relationship is a loose one. Because it is believed that inver- 
sions are the primary mechanism of genome rearrangement for many taxa l it .1151 . we 
seek a solution to the median problem based directly on inversion distance. Finding an 
inversion median is known to be NP-hard [)3||, and to date, no one has reported a rea- 
sonably efficient algorithm (approximate or exact) for this problem. (Although in one 
study Q, inversion medians were obtained for a particular data set using a bounded 
exhaustive search.) 

In this paper, we present a simple yet effective branch-and-bound algorithm to solve 
the median of three problem exactly. Our approach does not depend on properties spe- 
cific to inversions, but can be used with any rapidly computable metric. We have evalu- 
ated its effectiveness for the case of inversion medians, and found that it obtains optimal 
medians with reasonable computational effort for a range of parameters that include 
most realistic instances encountered in phylogenetic analysis. In addition, we have per- 
formed a comparison of inversion and breakpoint medians, and found that inversion 
medians score significantly better in terms of total induced edge length, and tend to be 
far more unique. These findings suggest that inversion medians, when used within the 
algorithm of Sankoff and Blanchette, will allow better trees to be computed in fewer 
iterations. 



2 Notation and Definitions 

We consider the case where all genomes have identical sets of n genes and inver- 
sion is the single mechanism of rearrangement. We represent each genome Gi as a 
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permutation of size n, and we let all pairs of genomes Gi = {gi^i . . ■ gi^n) and 
Gj = {gj^i . . . gj,n), in a set of genomes G, be represented by . . . Wi^n) and 

= (^i.i ■ ■ ■ such that = T^j,i iff G*,fc = Gj^i, and ^T^^k = -1 • T^j,i iff G*,fc 
is the reverse complement of Gj^i. 

We define an inversion acting on permutation tt from z to j, for i < j, as that opera- 
tion which transforms tt into (p = (tti, 7T2, . . . , tti - i , , — TTj, zr^+i, . . . , 

7T„). The minimal number of inversions required to change one permutation tt i into 
another permutation tTj is the inversion distance, which we denote by d{iti,TTj) (some- 
times abbreviated as dij). 

Let the inversion median M of a set of N permutations II = {7ri,7r2,...,7TAr}be 
the signed permutation that minimizes the sum S{M, 77) = iTi)). Let this 

sum S{M, 77) = S'(77) be called the median score of M with respect to 77. 

For a given number of genes n, we can construct an undirected graph G „ = {V, E) 
such that each vertex in V corresponds to a signed permutation of size n and two ver- 
tices are connected by an edge if and only if one of the corresponding permutations 
can be obtained from the other through a single inversion; formally, E = {{vi, Vj} \ 
Vi,Vj G V and d{Tti,Tij) = 1}. We will call G„ the inversion graph of size n. In 
this graph, the distance between any two vertices, Vi and Vj, is the same as the inver- 
sion distance between the corresponding permutations, Wi and nj. Furthermore, find- 
ing fhe median of a sef of permutations 77 is equivalent to finding the minimum un- 
weighted Steiner tree of the corresponding vertices in G„. Note that G„ is very large 
(1 1^1 = n! • 2"), so this representation does not immediately suggest a feasible graph- 
search algorithm, even for small n. 

Definition 1. A shortest path between two permutations of size n, tti and tt 2 , is a con- 
nected subgraph of the inversion graph Gn containing only the vertices Vi and V 2 cor- 
responding to 7Ti and 7T2, and the vertices and edges on a single shortest path between 
v\ and V 2 - 



Definition 2. A median path of a set of permutations 77 each of size n is a connected 
subgraph in the inversion graph of Gn containing only the vertices corresponding to 
permutations in 77, the vertex corresponding to a median M of II, and a shortest path 
between M and each t: G II. 



Definition 3. A trivial median of a set of permutations 77 is a median M that is a 
member of that set, M G II. 



Definition 4. A trivial median path of a set of permutations 77 is a median path that 
includes only the elements of II and shortest paths between elements of U. 



3 General Median Bounds 

Because phylogenetic reconstruction algorithms generally work with binary trees in 
which each internal node has three neighbors, the special case of the median of three 
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genomes is of particular interest. In this section we develop a general bound for the 
median-of-three problem, one that relies only on the metric property of the distance 
measure used. 



Lemma 1. The median score S{II) of a set of equally sized permutations U = {tti, 
7T2, TTa}, separated by pairwise distances di^^, and (^2,3, obeys these bounds: 



dl,2 + di,3 + <^ 2,3 



< S{n) < min-|(di_2 + ^2,3)1 (<^1,2 + di^s), (^2,3 + <^1,3) 



Proof The upper bound follows directly from the possibility of a trivial median, and 
the lower bound from properties of metric spaces (a median of lower score would neces- 
sarily violate the triangle inequality with respect to two of tti, 712, and 773; see FigureQ. 



v\ 




Fig. 1. Let vertices v\, V2, and U3 correspond to permutations tti, 712, and 713, and let 
vertex vm correspond to a median M. 



Lemma 2. If three permutations tti , 712, and it 3 have a median M that is part of a 
trivial median path, then M must be a trivial median. 

Proof Assume to the contrary that 711,712, and 713 have a trivial median path and have a 
median M that is not trivial. By DehnitionEl M must be on a shortest path between two 
of Til , 712 , and 713. Without loss of generality, assume that the median path runs from ti 1 
to M to 712 to 713. Let di,2, di,3, and c?2,3 be the pairwise distances between {tii, 712}, 
{tti, 713}, and {712, 713}, respectively, and let dM,2 > 0 be the distance of M from 712. 
Then the median score of M is (di,2-rfM,2)+rfM,2+(rfM,2+rf2,3) = rfi,2+rfM,2+rf2,3- 
But this score is greater by dM,2 than the score of a trivial median at 712, so M cannot 
be a median. 



Theorem 1. Let tii, 712, and 713 be permutations such that 712 and 713 are separated 
by distance ^2,3, and let f be another permutation separated from tti, 712, and 713 by 
distances d2,<^j tind <^3^0, respectively. Suppose that f is on a median path Pm of 
Til, 712 , tind 713 such that f is on a shortest path between 711 and a median M. Then the 
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score S{M) of M obeys these bounds: 









^ 3,0 ■ 



^ 2,3 



< S{M) 



< + min{(d2,0 + ^^ 2 , 3 )) (t^2,<:i + <^3,<^)) (^2,3 + 



Proof. Let t^i, t;2, and be vertices corresponding to tti, 7T2, and tts, in the inversion 
graph of the appropriate size. In addition, let there be a vertex v ^ corresponding to f, as 
illustrated in Figure El We claim that a median path Pm including and M, such that 



Vi 

f 




^2 c?2,3 




Vs 



Fig. 2. A median path including v^j, can be constructed using a shortest path from to 
and any median path of t;2, and V3. 



V(j) is on a shortest path from v\ to M, can be constructed by combining a shortest path 
between vi and v^f, and a median path of v,p, V2, and V3. Assume to the contrary that 
there exists a shorter median path P short, which also includes v^f, and M, but does not 
include the shortest path between r;i and Vcf, or does not include a median path ofv^,V2, 
and V3. Pshort has to include vi via a vertex other than and consequently other than 
M (because is on a shortest path between v\ and M). By Definitional Pshort must 
consist only of V\,V2, V3, M, and vertices between them (including v^), so vi must be 
connected to Pshort via V2 or V3. Consequently, M must be on a shortest path between 
V2 and Vs; otherwise including M in Pshort would result in a score greater than that 
of a trivial median. Therefore, M is part of a trivial median path, which means that by 
LemmaO M is a trivial median. In particular, M must be the vertex Vi G {v2, to 
which v\ is connected. Furthermore, our assumptions about f require that f ^ be on the 
shortest path between v\ and Vi. Then Pshort includes both the shortest path between 
and and the median path of v^,V2, and 113, and we obtain the desired contradiction. 

Because Pm can be constructed by combining a shortest path between v\ and v^, 
and a median path of v^,V2, and vs, its score is equivalent to the sum of the distance 
between vi and (di,(^), and the score of the median of v^, V2, and vs- By applying 
LemmaQto the latter, we obtain the desired bound. 
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4 The Algorithm 



Algorithm f ind_invers ionjnedian is presented below. It is essentially a branch- 
and-bound search for an optimal inversion median that uses Theorem □to prune a search 
tree based on the inversion graph and to prioritize among search branches. 

Prioritization is managed using a priority stack — which always returns an item of 
highest priority, but returns items of equal priority in last-in-first-out order. Because 
the range of possible priorities is small, we use a hxed array of priority values, each 
pointing to a stack and so can execute push and pop operations in fast constant time. 
Using stacks rather than the more conventional queues in this application is not required 
for correctness, but, by inducing depth-first searching among alternatives of equal cost, 
rapidly produces a good upper bound for the search. 

The algorithm begins by establishing upper and lower bounds for the solution using 
Lemman](stepsnand0 and priming the priority stack with a best-scoring vertex (steps 
0andE|l. Then it enters a main loop (step in which it repeatedly pops the “most 
promising” vertex from the priority stack, hnds all of its as-yet-unvisited neighbors 
(step0), and evaluates each one for feasibility. Neighbors are obtained by generating 
all ( 2 ) possible permutations that can be produced from a vertex by a single inversion. 
Neighbors of a vertex v can be ignored if they are not farther from the origin than 
is V (step0); such vertices will be examined as neighbors of another vertex if they can 
feasibly belong to a median path. The best possible score (i.e., lower bound) for a vertex 
w is is used as the basis for prioritization. Best and worst possible scores are calculated 
using the bounds of Theorem □(stepsQ3and[n]) and maintained for all vertices present 
in the priority stack. Vertices can be pruned when their best possible scores exceed the 
current global upper bound. The global upper bound can be lowered when a vertex is 
found that has a lesser upper bound (step ITlt . The search ends when no vertex in the 
queue has a best-possible score lower than the upper bound (step 0) or a score equal to 
the global lower bound is found (step [Q. 

The algorithm will return a permutation M only if M is a true median of the inputs 
7Ti , 7T2, and 7T3. Assume to the contrary that a permutation M ' returned by the algorithm 
is not a true median. Because the algorithm returns the permutation having the lowest 
median score of all of the permutations (vertices) it visits (steps ElandEJ, it must not 
have visited some median. If the algorithm did not visit some median, then either it 
pruned all paths to medians or it exited before reaching any median. 

Suppose the algorithm pruned all paths to medians. It only prunes vertices when 
their best possible scores are lower than the current global upper bound, Mmax- Note 
that the global upper bound always corresponds to the actual median score of a vertex 
that has been visited (steps □ and IBl), so it cannot be wrong. Consider a median M 
with at least one median path Pm- By Definitions Pm must include at least one path 
between M and each of the vertices vi,V 2 , and V 3 corresponding to 7ri,7r2, and tts. 
The algorithm proceeds by examining neighbors of an origin tp orig G {tti, 7T2, tts}. 
Therefore, if the algorithm pruned all paths to M, then it must have pruned a vertex on 
the path between iporig and M. But the best scores of such vertices are calculated using 
the lower bound of Theorem rfltstep m . which we have shown to be correct. Therefore, 
the algorithm cannot have pruned the shortest paths to medians. 



Finding an Optimal Inversion Median: Experimental Results 195 



Algorithm 1: findjnversiotijnedian 

Input: Three signed permutations of size n: tti, 7 T 2 , and tts. Assume a function dis- 
tance (TTijTTj) that returns the inversion distance between Hi and Hj in linear 
time. 

Output: An optimal inversion median M. 

begin 

di ,2 ^ distance (tti, 7T2) ; 
di ,3 <— distance (tti, tts) ; 

^ 2,3 <— distance (7T2, 7T3) ; 

, r ‘tl.2+til,3 + '*2,3 1 . 

1 IVlmin ^ ] 2 I ’ 

2 Mmax ^ min{(di ,2 + ^2,3), (di ,2 + di.s), (^2,3 + di.s)}; 

Initialize priority stack s for range Mmin to Mmax ; 

(pporig 5 ^ 1 , V’ 2 ) ^ {7Vi,7Vj,7Vk) such that {tt^, tTj, tt/J = {tti, 7 T 2 , tts} and 
“h di,k — 

3 create vertex X) with “Vlabel — '^orig ^ ^dist ' — st — 'C worst — ^tlmax > 

4 push (i, v) ; 

5 M < tporig't 

dsep t 1 

stop ^ false ; 

6 while s is not empty and stop = false do 

pop (s, v) ; 

7 if Vbost > Mmax theu stop ^ true ; 

else 

8 foreach {m | w is an unmarked neighbor ofv} do 

Wdist ^ distance (Wia6ei,tAor-ig) ; 

9 if Wdist < Vdist then continue; 
mark w, 

^ distance {wiabei, t/>i ) ; 
dyj 2 ^ distance (lOiabei, ■i/> 2 ) ; 

19 Wbest ^ Wdist T [ 2 1 ’ 

11 Wworst t Wdist miu^^d^^ “|“ dsepf “t” di ^2 ) 5 {dsap 

12 if Wworst = Mmin then M ^ wiabei', stop ^ true ; 

else 

ifW 6 est < Mmax then pUSh ( S, W, m 6 est ) ; 

13 if Wworst ^ Mimax theU 

1_ M < Wlabell Mmax t Wworst^ 



end 
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Suppose instead that the algorithm exited before reaching a median. The algorithm 
can exit for one of three reasons: 

1. The priority stack s becomes empty (step 

2. The next item returned from s has a best possible score greater than or equal to the 
current global upper bound (step Q); 

3. A vertex w is found with a worst possible score equal to the global lower bound 

(stepO); 

Case d can occur only if all vertices have been visited, or if all remaining neighbors 
have been pruned (because except when the algorithm stops for another reason, each 
new neighbor is either pruned or pushed onto s). If all vertices have been visited, then 
a median must have been visited. We have shown above that all neighbors on paths to 
a median cannot have been pruned. Because s always returns a vertex v such that no 
other vertex in s has a lower best-possible score than v, and because all neighbors that 
are not pruned are added to s, case |^can only occur if a median has been visited or if 
all paths to medians have been pruned. We have shown that all paths to medians cannot 
have been pruned. Therefore, if case |2|occurs, a median must have been visited. In case 
0 w must be a median, since the global lower bound is set directly according to Lemma 
n(step[I]), which we have shown to be correct. 

Thus, none of these three cases can arise before a median has been found, and 
the algorithm must return a median. The worst-case running time of the algorithm is 
0{n^‘^), with d = min{di. 2 , ^ 2 , 3 , but as would be expected with a branch-and- 
bound algorithm, the average running time appears to be much better. 

5 Experimental Method 

We implemented f ind_inversion_tnedian in C, reusing the linear-time distance 
routine (as well as some auxiliary code) from GRAPPA im , and we evaluated its perfor- 
mance on simulated data. All test data was generated by a simple program that creates 
multiple sets of three permutations by applying random inversions to the identity per- 
mutation, such that each set of three permutations represents three taxa derived from a 
common ancestor under an inversions-only model of evolution. In addition to the num- 
ber of genes n to model and the number of sets s to create, this program accepts a 
parameter i that determines how many random inversions to apply in obtaining the per- 
mutation for each taxon. Thus, if n = 100, i = 10, and s = 10, the program generates 
10 sets of 3 signed permutations, each of size 100, and obtains each permutation by ap- 
plying 10 random inversions to the permutation -fl, -f 2, . . . , -f 100. A random inversion 
is defined as an inversion between two random positions i and j such that 1 < i,j < n 
(if i = j, a single gene simply changes its sign). When i is small compared to n, each 
permutation in a set tends to be a distance of 2i from each other. 

We used several algorithmic engineering techniques to improve the efficiency of 
f ind_inversion_median. For example, we avoided dynamic memory allocation 
and reused records representing graph vertices. We were able to gain a significant 
speedup by optimizing the hash table used for marking vertices: a custom hash table of- 
fered a fourfold increase in the overall speed of the program, as compared with UNIX’s 
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db implementation. With circular genomes, we achieved a further improvement in per- 
formance by hashing on the circular identity of each permutation rather than on the 
permutation itself. We define the circular identity of a permutation as that equivalent 
permutation that begins with the gene labeled -Fl. By hashing on circular identities, 
we reduced the number of vertices to visit and the number of permutations to mark by 
approximately a factor of 2n. 

To improve performance further, we adapted our sequential implementation to run 
in parallel on shared-memory architectures. Two steps in the algorithm are readily paral- 
lelizable: the major loop (step0), during each iteration of which a new vertex is popped 
from the priority stack, and the minor loop (step 0), in which the neighbors of a ver- 
tex V are generated, examined for marks, and evaluated for feasibility as medians. We 
enabled parallel processing at both levels, using pthreads for maximum portability 
across shared-memory architectures. With careful use of semaphores and pthreads 
mutex functions, we were able to reduce the cost of synchronization among threads to 
an acceptable level. 

6 Experimental Results 

6.1 Performance of Bounds 

Being especially concerned with the effectiveness of the pruning strategy, we have cho- 
sen as a measure of performance the number of vertices V of the inversion graph that 
the algorithm visited. In particular, we have taken V to be the number of times the 
program executed the loop at step |Hlof the algorithm. Note that the number of calls to 
distance is approximately 31^. We recorded the distribution of V over many exper- 
iments, in which we used various values for the number of genes n and the number of 
inversions per tree edge i. Figure Elis typical of our results. It summarizes 500 experi- 
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Fig. 3. Distribution of the number of vertices visited in the course of 500 experiments 
with n = 50 and i = 7. 
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ments with n = 50 and i = 7 and shows a roughly exponential distribution, with high 
relative frequencies in a few intervals having small 1^: in 87% of the experiments, fewer 
than 10,000 vertices were visited, and in 95%, fewer than 20,000 were visited. This hg- 
ure demonstrates that the algorithm generally hnds a median rapidly, but occasionally 
becomes mired in an unprofitable region of the search space. We have observed that 
the tail of the exponential distribution becomes more substantial as i grows larger with 
respect to n. 

In order to characterize typical performance, we recorded the statistical medians 
of V as n and i varied independently. The results are shown in Figures 0 and 0 For 




Fig. 4. Statistical median of the number of vertices visited V for i = 5 and 10 < n < 
100, over 50 experiments for each value of n. 



comparison, we have also plotted the mean values of V. Note that, at least for i = 5, 
the median and mean of V appear to grow quadratically over a significant range of 
values for n; a simple fit yields /(n) = 2.1n^ for the median values. Note also that, 
for n = 50, the median of V grows approximately linearly with i, at least as long as i 
remains small (mean V grows somewhat faster than median V). To put the observed rate 
of growth into perspective, note that in the theoretical worst-case of because 

d ^ 2i and V = 0(^^) = one would see (given i = 5 and n = 50) 

growth of V with and 50®*“^. 

6.2 Running Time and Parallel Speedup 

We have tested program f ind_invers ionjnedian sequentially on a 700 MHz Intel 
Pentium III with 128MB of memory, and using various levels of parallelism on a Sun 
ElOOOO with 64 333 MHz UltraSPARC processors and 64GB of memory. Figure El 
shows average running times for i = 5 and n between 50 and 125. Sequential running 
times are shown for the Sun and Intel processors and parallel running times for the Sun 
with the number of processors p G {1, 2, 4, 6}. In all cases, the average time to find 
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Fig. 5. Statistical median of the number of vertices visited V for n = 50 and 1 < i < 8, 
plotted with mean of V. The number of experiments for each value of i is 50. 




Fig. 6. Sequential and parallel running times for i = 5 and n G {50, 75, 100, 125}. Each 
data point represents an average taken over 10 experiments. Parallel configurations used 
parallelism only in the minor loop of the algorithm. 



a median is about 12 seconds or less. Observe that for n = 100 (a realistic size for 
chloroplast or mitochondrial genomes) medians can generally be found in an average 
of about 2 seconds using a reasonably fast computer. We should note that the memory 
requirements for the program are considerable, and that the level of performance shown 
here is partly a consequence of the large amount of RAM available on the Sun. 

It is evident from Figured! that we achieve a good parallel speedup for small p, but 
that the benefits of parallelization begin to erode between p = 4 and p = 6 (this ten- 
dency becomes more pronounced at p = 8, which we have not plotted here for clarity of 
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presentation). Anecdotal evidence suggests that the cause of this trend is a combination 
of the overhead of synchronization and uneven load balancing among the computing 
threads. We also observed that parallelism in the minor loop of the algorithm was far 
more effective than parallelism in the major loop, presumably because the heuristic for 
prioritization is sufficiently effective that the latter strategy results in a large amount of 
unnecessary work. 



6.3 Inversion Medians vs. Breakpoint Medians 

Using program f ind_inversion_median, we evaluated the significance of inver- 
sion medians, by comparing them with breakpoint medians, trivial medians, and “ac- 
tual” medians (i.e., the ancestral permutations from which observed taxa actually arose 
- in this case, always equal to the identity permutation). Figure [3 which shows results 
over 1 < f < 5 for n = 25, is typical of what we observed. It demonstrates that 




I 



Fig. 7. Comparison of inversion medians with breakpoint medians, trivial medians, and 
actual medians, for n = 25. Averages were taken over 50 experiments. 



inversion medians achieve comparable scores to actual mediansQ and that breakpoint 
medians, when scored in terms of inversion distance, perform signihcantly worse. A 
comparison in terms of inversion median scores is clearly biased in favor of inversion 
medians; however, if it is true that inversion distances are (in at least some cases) more 
meaningful than breakpoint distances, then these results suggest that inversion medians 
are worth obtaining. 

We used a slight modeification of program f ind_Lnversionjnedian to find 
all opfimal medians and thus to characterize the extent to which inversion medians are 
unique. An example of our results is shown in Figure |Hl which describes the number 

* Inversion medians are slightly better than actual medians when i becomes large with respect 
to n, because saturation begins to cause convergence between taxa. 



Finding an Optimal Inversion Median: Experimental Results 



201 




Number of Optimal Medians 

Fig. 8. Distribution of number of optimal medians in the course of 50 experiments for 
n = 15 and 1 < i < 5. 



of optimal inversion medians for n = 15 and 1 < i < 5, over 50 experiments for 
each value of i. Observe that, when i is small compared to n (roughly i < 0.15n), the 
inversion median is virtually always unique; and even when i is moderately large with 
respect to n (roughly 0.15n < i < O.Snji the inversion median is unique or nearly 
unique most of the time. This finding stands in stark contrast with breakpoint medians, 
which are only very rarely unique. 

In addition, we observed a strong relationship between unique inversion medians 
and actual medians. For example, with n = 15 and i = 1, for which all inversion 
medians were unique, 49 out of 50 inversion medians were identical to actual medians; 
similarly, for n = 15 and i = 2, 48 out of 50 were identical to actual medians (in 
both cases the exceptional inversion medians differed from actual medians by a single 
inversion). As i becomes greater compared to n, this relationship weakens but remains 
significant. For example, with n = 15 and i = 4, 38 out of 50 inversion medians 
were unique, and 22 of those 38 were identical to actual medians (an additional 10 
non-unique inversion medians equaled actual medians). 

7 Future Work 

The strength and weakness of the current algorithm both lie in its generality. On the one 
hand, our approach depends only on elementary properties of metric spaces and thus 

^ Recall that the distance between permutations is approximately 2i and that random permu- 
tations tend to be separated by a distance of approximately n. The effects of saturation are 
evident at i = 0.2n and are pronounced at i = 0.3n. 
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extends easily to the case of equally weighted inversions, translocations, hssions, and 
fusions; furthermore, it could also be used with weighted rearrangement distances. (One 
should note, however, that the running time is a direct function of the cost of evaluating 
distances; we can compute exact breakpoint and inversion distances, but no efficient 
algorithm is yet known for more complex distance computations.) On the other hand, 
our approach does not exploit the unique structure of the inversion problem; as shown 
elsewhere in this volume by A. Caprara, restricting the algorithm to inversion distances 
only and using aspects of the Hannenhalli-Pevzner theory enables the derivation of 
tighter bounds and thus also the solution of larger instances of the inversion median 
problem. 

Many simple changes to our current implementation will considerably reduce the 
running time. For example, the current implementation does not “condense” genomes 
before processing them — i.e., it does not convert subsequences of genes shared among 
all three genomes to single “supergenes”. Preliminary experiments indicate that con- 
densing genomes yields very signihcant improvements in performance when i is small 
relative to n. Distance computations themselves, while already fast, can be further im- 
proved by reusing previous computations, since a move by the algorithm makes only 
minimal changes to the candidate permutation. Finally, we can use the Kaplan-Shamir- 
Tarjan algorithm, in combination with metric properties, to prepare better initial solu- 
tions (by walking halfway through shortest paths between chosen permutations), thus 
considerably decreasing the search space to be explored. 
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Abstract. We consider the problem of finding the maximum likelihood rooted 
tree under a molecular clock (MLmc), with three species and 2-state characters 
under a symmetric model of substitution. For identically distributed rates per site 
this is probably the simplest phylogenetic estimation problem, and it is readily 
solved numerically. Analytic solutions, on the other hand, were obtained only 
recently (Yang, 2000). 

In this work we provide analytic solutions for any distribution of rates across sites 
(provided the moment generating function of the distribution is strictly increas- 
ing over the negative real numbers). This class of distributions includes, among 
others, identical rates across sites, as well as the Gamma, the uniform, and the 
inverse Gaussian distributions. Therefore, our work generalizes Yang’s solution. 
In addition, our derivation of the analytic solution is substantially simpler. We 
employ the Hadamard conjugation (Hendy and Penny, 1993) and convexity of an 
entropy-like function. 



1 Introduction 

Maximum likelihood (Felsenstein, 1981) is increasingly used as an optimality criterion 
for selecting evolutionary trees, but finding the global optimum is difficult computa- 
tionally, even on a single tree. Because no general analytical solution is available, it is 
necessary to use numeric techniques, such as hill climbing or expectation maximiza- 
tion (EM), in order to find optimal values. Two recent developments are relevant when 
considering analytical solutions for simple substitution models with a small number of 
taxa. Yang (2000) has reported an analytical solution for three taxa with two state char- 
acters under a molecular clock. Thus in this special case the tree and the edge lengths 
that yield maximum likelihood values can now be expressed analytically, allowing the 
most likely tree to be positively identified. Yang calls this case the “simplest phytogeny 
estimation problem”. 

A second development is in Chor et al. (2000), who used the Hadamard conjugation 
for unrooted trees on four taxa, again with two state characters. As part of that study 
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analytic solutions were found for some families of observed data. It was reported that 
multiple optima on a single tree occurred more frequently with maximum likelihood 
than has been expected. In one case, the best tree had a local (non global) optimum that 
was less likely than the optimum value on a different, inferior tree. In such a case, a hill 
climbing heuristic could misidentify the “optimal” tree. Such examples reinforce the 
desirability of analytical solutions that guarantee to find the global optima for any tree. 

Even though three taxon, two state characters models under a molecular clock is 
the “simplest phylogeny estimation problem”, it is still potentially an important case 
to solve analytically. It can allow a “rooted triplet” method for inferring larger rooted 
trees by building them up from the triplets. This would be analogous to the use of un- 
rooted quartets for building up unrooted trees. Trees from quartets methods are already 
used extensively in various studies (Bandelt and Dress 1986, Strimmer and von Hae- 
seler 1996, Wilson 1998, Ben-Dor et al. 1998, Erdos et al. 1999). The fact that general 
analytical solutions are not yet available for unrooted quartets only emphasizes the im- 
portance of analytical solutions to the rooted triplets case. 

In this work we provide analytic solutions for three taxon ML mc trees under any 
distribution of variable rates across sites (provided the moment generating function 
of the distribution is strictly increasing over the negative real numbers). This class of 
distributions includes, as a special case, identical rates across sites. It also includes 
the Gamma, the uniform, and the inverse Gaussian distributions. Therefore, our work 
generalizes Yang’s solution of identical rates across sites. In addition, our derivation 
of the analytic solution is substantially simpler. We employ the Hadamard conjugation 
(Hendy and Penny 1993, Hendy, Penny, and Steel 1994) and convexity of an entropy- 
like function. 

The remainder of this paper is organized as follows: In subsection Elwe explain the 
Hadamard conjugation and its relation to maximum likelihood. In Section 0 we state 
and prove our main technical theorem. Section 0 applies the theorem to solve MLmc 
analytically on three species trees. Finally, Section Elpresents some implications of this 
work and directions for further research. 



2 Hadamard Conjugation and ML 

The Hadamard conjugation (Hendy and Penny 1993, Hendy, Penny, and Steel 1994) is 
an invertible transformation linking the probabilities of site substitutions on edges of 
an evolutionary tree T to the probabilities of obtaining each possible combination of 
characters. It is applicable to a number of simple models of site substitution: Neyman 2 
state model (Neyman 1971), Jukes-Cantor model (Jukes and Cantor 1969), and Kimura 
2ST and 3ST models (Kimura 1983). For these models, the transformation yields a 
powerful tool which greatly simplifies and unifies the analysis of phylogenetic data. In 
this section we explain the Hadamard conjugate and its relationships to ML. 

We now introduce a notation that we will use for labeling the edges of unrooted 
binary trees. (For simplicity we use four taxa, but the definitions extend to any n.) 
Suppose the four species, 1, 2, 3 and 4, are represented by the leaves of the tree T' . 
A split of the species is any partition of {1, 2, 3, 4} into two disjoint subsets. We will 
identify each split by the subset which does not contain 4 (in general n), so that for 
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example the split {{1, 2}, {3, 4}} is identified by the subset {1, 2}. Each edge e of T 
induces a split of the taxa, namely the two sets of leaves on the two components of T 
resulting from the deletion of e. Hence the central edge of the tree T' = (12) (34) in the 
brackets notation induces the split identified by fhe subset {1,2}. For brevity we will 
label this edge by ei 2 as a shorthand for e{i_ 2 }- Thus E{T') = |ei, C 2 , ei 2 , 63 , 6123 } 
(see Figure 1). 




Fig. 1. The Tree T' = (12) (34) and Its Edges. 

We use a similar indexing scheme for splits at a site in the sequences: For a sub- 
set a C {1, ..., n — 1}, we say that a given site i is an a-split pattern if a is the set 
of sequences whose character state at position i differs from the i-th position in the 
n-th sequence. Given a tree T with n leaves and edge lengths q = \qe]e^E(T) (0 < 
ge < 00 ) (where is the expected number of substitutions per site, across the edge 
e), the expected probability (averaged over all sites) of generating an a-split pattern 
(a C {1, . . . , n — 1}) is well defined (fhis probabilify may vary across sites, depending 
on the distribution of rates). Denote this expected probability by s q, = Pr(a-split|T, q). 
We define fhe expected sequence spectrum s = [sa]ac{i,...,n-i}- Having fhis spectrum 
at hand greatly facilitates the calculation and analysis of the likelihood, since the likeli- 
hood of observing a sequence with splits described by the vector s given the sequence 
spectrum s equals 

L(s|s) = Pr(a-split |s)^“ = ■ 

1} Sa >0 

Definition 1. A Hadamard matrix of order £ is an t x I matrix A with ±1 entries such 
that A* A = ill. 

We will use a special family of Hadamard matrices, called Sylvester matrices in Mac/- 
Williams and Sloan (1977, p. 45), defined inductively for n > 0 by i7o = [1] and 

i7„+i = ^ .For example, 

\_^n 

fill' 

1-1 1-1 
1 1 - 1-1 ■ 

1 - 1-1 1 
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It is convenient to index the rows and columns of iT„ by lexicographically ordered 
subsets of {1, , n}. Denote by ha,-y the {a, 7 ) entry of then ha,-y = (— 

This implies that Hn is symmetric, namely = Hn, and thus by the definition of 
Hadamard matrices ^ 

The length of an edge qe, e G E{T) in the tree T was defined as the expected 
number of substitutions (changes) per site along that edge. The edge length spectrum of 
a tree T be with n leaves is the 2”“^ dimensional vector q = [9a]aC{i,...,n-i}> defined 
for any subset a C {1, . . . , n — 1} by 

{ ge \f e G E{T) induces the split a , 

~ '^e^E(T) 9e if Q; = 0 , 

0 otherwise. 

The Hadamard conjugation specifies a relation between the expected sequence spectrum 
s and the edge lengths spectrum q of the tree. 

Proposition 1. (Hendy and Penny 1993 ) Let T be a phylogenetic tree on n leaves with 
finite edge lengths (qe < oo for all e G E(T)). Assume that sites mutate according 
to a symmetric substitution model, with equal rates across sites. Let s be the expected 
sequence spectrum. Then for H = Hn-i we have: 

s = s(q) = H~^ exp(iTq) , 

where the exponentiation function exp is applied element wise to the vector p = Hep 
That is, for a <G {1, . . . , n - 1}, (exp h^sqs))- 

This transformation is called the Hadamard conjugation. 

For the case of unequal rates across sites, the following generalization applies: 

Proposition 2. (Waddell, Penny, and Moore 1997) Let T be a phylogenetic tree on n 
leaves with finite edge lengths (qe < oo for all e G E(T)). Assume that sites mutate 
according to a symmetric substitution model, with unequal rates across sites, so that 
M : R ^ R be the moment generating function of the rate distribution. Let s be the 
expected sequence spectrum. Then with H = Hn-i, 

s = s(q) = H-^(M(Hq)) , 

where the function M is applied element wise to the vector p = Hq. 

This transformation is called the Hadamard conjugation of M. Specific examples of 
the moment generating function include 

- For equal rates across sites, M(p) = e^. 

- For the uniform distribution in the interval [1 — 6 , 1 + &] with parameter 6 (1 > & > 
0 ), 

- For the R distribution with parameter k (k > 0), M (p) = (1 — p/k) ~ . 

- For the inverse Gaussian distribution with parameter d (d > 0), M(p) = 

Notice that for k oo, the T distribution converges to the equal rates distribution. 
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3 Technical Results 

Under a molecular clock, a tree on n taxa has at least two sister taxa i and j whose 
pendant edges qi and qj are of equal length {qi = qj). Our first result states that if qi = 
qj, then the corresponding split probabilities are equal as well (si = Sj). Knowing that 
a pair of these variables attains the same value simplifies the analysis of the maximum 
likelihood tree in general, and in particular makes it possible for the case of n = 3 taxa. 
Furthermore, if qi > qj and the moment generating function M is strictly increasing in 
the range (— oo, 0], then the corresponding split probabilities satisfy Si > Sj as well. 

3.1 Main Technical Theorem 

Theorem 1. Let i and j be sister taxa on a phylogenetic tree T on n leaves, with edge 
weights q. Let s be the expected sequence spectrum, let H = Hn-i, and let M be a 
real valued function such that 



s = 



then: 



qi = Qj 



Oj, 



and if the function M is strictly monotonic ascending in the range p G (— oo, 0] then: 



qi > qj Si > Sj. 

Proof Let X = {1, 2, . . . , n} be the taxa set with reference element n, and \eX X' = 
X — {n}. Without loss of generality i,j n. For a C X', let a' = aA{i,j} (where 
aAfi = (aU/3) — (aD/3) is the symmetric difference of a and /?). The mapping a ^ a' 
is a bijection between 

Xi = {a C X'\i ^ a,j G a} 

and 

Xj = {a C X'\i G a,j ^ a}. 

Note that the two sets Xi and Xj are disjoint. Writing ha,i for ha^{i] we have 
rr G Xi y h^^i — 1, ho^^j — 1, ho^yi — 1, h^yj — 1. 

On the other hand, if a ^ U Xj then = haj. Hence 

Sj - Sj = ZaCJC' (^a,i “ haj)M{pa) 

= 2 ~ ha,j)M{pa) + ~ ha,j)M{pa) 

X/aeJVj ~ ha,j)M{pcf^ 

= 2 (^aeXii^a,i — ha,j)M{pa) + ~ ha,j)M{pa)^ 

= 2 ~ haj)M(pa) + ,i ~ hayj)M{pa>)) 

= (Eaex. 2M(p„) - 2M(p„0) 

= 2-("-2) EcexSM{p^)-M{p^,)). 
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By the definition of the Hadamard conjugate, 

Pa — ^ ^ ^a,pQp » Pa Pa' — ^ ^ (^a,/3 ^a' ■ 

PCX' PCX' 

Now for /3 = 0 we have ha,p = ha\p = 1 so the contribution of /3 = 0 to po, — pa' 
is zero. Likewise, for any split l3 C X' (f3 ^ 0), which does not correspond to an 
edge e G E{T), qp = 0. So the only contributions to pa — Pa' may come from splits 
[3 corresponding to edges in T. Now since i and j are sister taxa in T, every edge 
e G E{T) that is not pendant upon i or j does not separate i from j. Thus the split /3 
corresponding to such edge e satisfies j3 ^ XiU Xj. Therefore the parities of |o! (T /3| 
and |of' n /3| are the same, so 

ha,p = = (-l)l“'n/5| = ha',p . 

Thus the only contributions to po, — pa> may come from the two edges pendant upon i 
and j, namely 



Pa Pa' — {hap ha' p T {ha,j ha',j)Qj ) 

and for a G X^we get pa — Pa' = 2((?j — qp). 

Thus if qi = qp then for every a G Xi we have pa = Pa', so M{pa) = M{pa') 
and thus Si — sp = Tha&x i^iPa) ~ M{Pa')) = 0 , namely Si = sp. Further 

if qi > qp then for every a G Xi we have p„ > Pa' ■ Now qtp = — J2eeE{T) he, and for 
every e G E{T), q^, > 0. Since pa = Y^p<zx' ha,pqp and 0 = 1 we conclude that 
Pq, < 0 for all a C X'. Therefore, if M is shictly monotone ascending in (— cxd, 0] then 
M{pa) > M{pa'),ya G Xi. Since Si - Sp = J^a&Xti^iPa) ~ ^{Pa')) , 

we have Si > Sp. □ 

We remark that the moment generating function M in the four examples of Section 0 
(equal rates across sites, uniform distribution with parameter 6, 0 < 6 < 1, Gamma 
distribution with parameter k,0 < k, and inverse Gaussian distribution with parameter 
d,0<d) are strictly increasing in the range p G (-00,0]. 

4 Three Taxa MLa^c Trees 

We first note that for three taxa, the problem of finding analytically the ML trees without 
the constraint of a molecular clock is trivial. This is a special case of unconstrained 
likelihood for the multinomial distribution. On the other hand, adding a molecular clock 
makes the problem interesting even for n = 3 taxa, which is the case we treat in this 
section. 

For n = 3, let sq be the probability of observing the constant site pattern (xxx or 
yyy). Let si be the probability of observing the site pattern which splits 1 from 2 and 
3 (xyy or yxx). Similarly, let S2 be the probability of observing the site pattern which 
splits 2 from 1 and 3 (yxy or xyx), and let S3 be the probability of observing the site 
pattern which splits 3 from 1 and 2 (xxy or yyx). 
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Consider unrooted trees on the taxa set X = { 1 , 2 , 3 } that have two edges of the 
same length. Let Ti denote the family of such trees with edges 2 and 3 of the same 
length (<72 = Qs), ^ denote the family of such trees with edges 1 and 3 of the same 
length (qi = <73), and denote the family of such trees with edges 2 and 1 of the same 
length ((72 = <7i). Finally, let 7 q denotes the family of trees with (71 = 92 = <73- We first 
see how to determine the ML tree for each family. 




Fig. 2. Three Trees in the Families Ti , 72 , Ts , respectively. 



Given an observed sequence of m sites, let mo be the number of sites where all three 
nucleotides are equal, and let rrii (i = 1, 2, 3 ) be the number of sites where the character 
in sequence i differs from the state of the other sequences. Then m = m 0 + mi + m2 + 
m3, and fi = rrii/m is the frequency of sites with the corresponding character state 
pattern. 

Theorem 2 . Let {mo, mi, m2, m3) be the observed data. The ML tree in each family 
is obtained at the following point: 

- For the family Tq, the likelihood is maximized at Tq with sq = /o, si = S2 = S3 = 

(1 - /o)/ 3 . 

- For the family Ti, the likelihood is maximized at T\ with sq = fo, si = /i, S2 = 
S3 = (/2 + / 3 )/ 2 . 

- For the family T2, the likelihood is maximized at T2 with sq = fo, S2 = f2, si = 

S 3 = (/i + /3)/2. 

- For the family To, the likelihood is maximized at To with sq = fo, S3 = fo,s\ = 

S2 = (/l + /2)/2. 

Proof. The log likelihood function equals 

3 

/(mo,mi,m2,m3|s) = log Sj, 

and for the normalized function i = have 

3 

^(mo,mi,m2,m3|s) = '^filogs^ . 

z=0 
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Consider, without loss of generality, the case of the T\ family. We are interested in 
maximizing ^ under the constraint q2 — qs — 0 . By Theorem L'TH this implies S2 — 
S3 = 0 . Therefore, using Lagrange multipliers, a maximum point of the likelihood 
must satisfy 



de _ d{s2 - S 3 ) 
dsi ^ dsi 



{i= 1 , 2 , 3 ), 



implying 



f]_ _ fo_ 

Si So 



h 

S2 




h 

S 3 




Denote d = /o/sq, then by adding the last two equations and substituting S3 = S2 we 
have /2 + /s = 2 ds 2 - Adding the right hand sides and left hand sides of this equality to 
these of /i = ds\ and /o = dso, we get 



/o + /l + /2 + /s — d{S(j + Si + 2S2) . 



Since both /o + /i + /2 + /a = 1 and sq + si + 2s2 = 1 , we get d = 1 . So the ML 
point for the family Ti is attained at the tree T\ with parameters 



So — /oj Si — /l, S2 — S3 — (/2 + /3)/2 ■ 



We denote by T2, T3, Tq the three corresponding trees that maximize the function € for 
the families T2,T^,Tq. The weights of these three trees can be obtained in a similar 
fashion to Ti. □ 



Theorem 3 . Assume m3 < m2 < m-i. Then the MLmc free equals T\. 

Proof. By Theorem 0 the maximum likelihood tree under the condition that two edges 
have the same length is one of the trees Ti, T2, or T3. Let 

G{.P) = fo log /o +plogp +{l- fo-p) log ^ — — ■ 

Substituting the values sq, si, S2, S3 for each tree in the expression 

3 

^(mo,mi,m2,m3|s) = , 

2=0 



and somewhat abusing the notation, we get the following values for the function £ on 
the three trees 



£(Ti) = G(/i) , 

m) = g (/ 2 ) , 
m) = G{h) . 
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The function G{p) behaves similarly to minus the binary entropy function (Gallager, 
1968) 

-H{p) =plogp+ (1 -p)log(l -p) . 

The range where G{p) is defined is 0 < p < 1 — /q. In this interval, G(p) is negative 
and U-convex, just like —H{p). So G has a single minimum at the point po where its 
derivative is zero, dG{p) /dp = 0. Solving for p we get po = (1 — /o)/3. 

Now /s < /2 < fi and G(p) is U-convex. Therefore, out of the three values 
G{fi),G{f 2 ),G{fs), the maximum is attained at either G{f^) or at G{fi), but not 
at G(/2) (unless /2 = /i or /2 = /s). 

Since /s + /2 + /i = 1 - /o and /a < /2 < fi, we have /a < (1 - /o)/3 < /i, 
namely the two “candidates” for ML points are on different sides of the minimum point. 
The point /a is strictly to the left and the point fi is strictly to the right (except the case 
where /a = fi and the two points coincide). If G(/i) > G(/a), then the tree T\ is the 
obvious candidate for MLmc tree. Indeed, T\ satisfies sa = S2 < si, so by Theorem 
0 <73 = 92 < 9i- Thus, a root can be placed on the edge gi so that the molecular clock 
assumption is satisfied. 

We certainly could have a case where G(/a) > G(/i). However, the tree Ta has 
Sa < Si = S2, implying (by Theorem Q)qa < 9i = 92- Therefore there is no way 
to place a root on an edge of Ta so as to satisfy a molecular clock. In fact, any tree 
with edge lengths 9a < 9i = 92 does not satisfy a molecular clock. So the remaining 
possibilities could be either the tree Tq (where si = S2 = sa = (1 — /o)/3) or the tree 
Ti . As To attains the minimum over the function G, we are always better off taking the 
tree Ti (except in the redundant case fi = /a, where all these trees collapse to Tq). This 
completes the proof of Theorem 0 □ 

The case m 2 < m 3 < mi and its other permutations can clearly be handled similarly. 



5 Discussion and Open Problems 

In the case where G(/a) > G(/i), Ti is still the MLmc tree. However, if the difference 
between the two values is significant, it may give a strong support for rejecting a molec- 
ular clock assumption for the given data mo, mi, m 2 , m 3 . This would be the case, for 
example, when 0 « m 3 <C mi « m2. 

Two natural directions for extending this work are to consider four state characters 
and to extend the number of taxa to n = 4 and beyond. The question of constructing 
rooted trees from rooted triplets is an interesting algorithmic problem, analogous to 
that of constructing unrooted trees from unrooted quartets. The biological relevance of 
triplets based reconstruction methods is also of interest. 
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Abstract. We study the convergence rates of neighbor-joining and several new 
phylogenetic reconstruction methods on families of trees of bounded diameter. 
Our study presents theoretically obtained convergence rates, as well as an empir- 
ical study based upon simulation of evolution on random birth-death trees. We 
find that the new phylogenetic methods offer an advantage over the neighbor- 
joining method, except at low rates of evolution where they have comparable per- 
formance. The improvement in performance of the new methods over neighbor- 
joining increases with the number of taxa and the rate of evolution. 



1 Introduction 

Phylogenetic trees (that is, evolutionary trees) form an important part of biological re- 
search. As such, there are many algorithms for inferring phylogenetic trees. The ma- 
jority of these methods are designed to be used on biomolecular (i.e., DNA, RNA, or 
amino-acid) sequences. Methods for inferring phytogenies from biomolecular sequence 
data are studied (both theoretically and empirically) with respect to the topological ac- 
curacy of the inferred trees. Such studies evaluate the effects of various model con- 
ditions (such as the sequence length, the rates of evolution on the tree, and the tree 
“shape”) on the performance of various methods. 

The sequence length requirement of a method is the sequence length needed by 
the method in order to obtain (with high probability) the true tree topology. Earlier 
studies established analytical upper bounds on the sequence length requirements of 
various methods (including the popular neighbor-joining [ESI method). These studies 
showed that standard methods, such as neighbor-joining, recover the true tree (with high 
probability) from sequences of lengths that are exponential in the evolutionary diameter 
of the true tree. Based upon these studies, in iBEI we defined a parameterization of 
model trees in which the longest and shortest edge lengths are fixed, so that the sequence 
length requirement of a method can be expressed as a function of the number of taxa, n. 
This parameterization leads to the definition of “fast-converging” methods, which are 
methods that recover the true tree from sequences of lengths bounded by a polynomial 
in n once /, the minimum edge length, and g, the maximum edge length, are bounded. 
Several fast-converging methods were developed We and others analyzed 
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the sequence length requirement of standard methods, such as neighbor-joining (NJ), 
under the assumptions that / and g are fixed. These studies showed that neighbor- 
joining and many other methods can be proven to be “exponentially-converging”, that 
is, they recover the true tree with high probability from sequences of lengths bounded 
by a function that grows exponentially in n. So far, none of these standard methods are 
known to be “fast-converging.” 

In this paper, we consider a different parameterization of the model tree space, 
where we fix the evolutionary diameter of the tree, and let the number of taxa vary. 
This parameterization, suggested by John Huelsenbeck [personal communication], al- 
lows us to examine the differential performance of methods with respect to “taxon sam- 
pling” strategies Q. In this case, the shortest edges can be arbitrarily short, forcing the 
method to require unboundedly long sequences in order to recover these shortest edges. 
Hence, the sequence length requirements of all methods cannot be bounded. However, 
for a natural class of model trees, it can be assumed that / = 0{l/n) (for example, 
random birth-death trees fall into this class). In this case even very simple polynomial 
time methods converge to the true tree from sequences whose lengths are bounded by 
a polynomial in n. Furthermore, the degrees of the polynomials bounding the conver- 
gence rates of neighbor-joining and the “fast-converging” methods are identical - they 
differ only with respect to the leading constants. Therefore, with respect to this pa- 
rameterization, there is no significant theoretical advantage between standard methods 
and the “fast-converging” methods. We then evaluate two methods, neighbor-joining 
and DCM-NJ+MP (a method introduced in [ED with respect to their performance on 
simulated data, obtained on random birth-death trees with bounded deviation from ul- 
trametricity. We find that DCM-NJ+MP obtains an advantage over neighbor-joining 
throughout most of the parameter space we examine, and is never worse. That advan- 
tage increases as the deviation from ultrametricity increases or as the number of taxa 
increases. 

The rest of the paper is organized as follows. In Section 0 we present the basic 
definitions, models of evolution, methods, and terms, upon which the rest of the paper 
is based. In Section 0 we present the theory behind convergence rate bounds for both 
neighbor-joining and “fast-converging” methods. We derive bounds on the convergence 
rates of various methods for trees in which the evolutionary diameter (but not the short- 
est edge lengths) is fixed. We then derive bounds on the convergence rates of these 
methods for random trees drawn from the distribution on birth-death trees described 
above. In Section^ we describe our experimental study comparing the performance of 
neighbor-joining and DCM-NJ+MP. In Section 0, we conclude with a discussion and 
open problems. 

2 Basics 

In this section, we present the basic definitions, models of evolution, methods, and 
terms, upon which the rest of the paper is based. 
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2.1 Model Trees 

The first step of every simulation study for phylogenetic reconstruction methods is to 
generate model trees. Sequences are then evolved down these trees, and these sequences 
are used, by the methods in question, to estimate the model tree. The accuracy of the 
method is determined by how well the method reproduces the model tree. Model trees 
are often taken from some underlying distribution on all rooted binary trees with n 
leaves. Some possible distributions include the uniform (all binary trees on n leaves are 
equiprobable) and the Yule-Harding distribution (a distribution based upon a model of 
speciation). 

In this paper, we use random birth-death trees with n leaves as our underlying dis- 
tribution. To generate these trees, we view speciation and extinction events occurring 
over a continuous interval. During a short time interval. At, a species can split into two 
with probability b{t)At, and a species can become extinct with probability d{t)At. The 
values of b{t) and d{t) depend on how much time has passed in the model. To generate 
a tree with n taxa, we begin this process with a single node and continue until we have a 
tree with n taxa (with some non-zero probability some processes will not produce a tree 
of the desired size since all nodes could go “extinct” before n species are generated; if 
this happens, we repeat the process, until a tree of the desired size is generated). Under 
this distribution, trees have a natural length assigned to each edge- that is the time t 
between the speciation event that began that edge and the event (which could be either 
speciation or extinction) that ended that edge. 

Birth-death trees are inherently ultrametric, that is, the branch lengths are propor- 
tional to time. In all of our experiments we modified each edge length to deviate from 
this assumption that sites evolve under the strong molecular clock. To do this, we multi- 
plied each edge by a random number within a range [1/c, c] , where we set c to be some 
small constant. We call this constant the deviation factor. 

2.2 Models of Evolution 

Under the Kimura 2-Parameter (K2P) model ca , each site evolves down the tree under 
the Markov assumption, but there are two different types of nucleotide substitutions: 
transitions and transversions. A transition is a substitution of a purine (an adenine or 
guanine nucleotide) for a purine, or a pyrimidine (a cytosine or thymidine nucleotide) 
for a pyrimidine; a trans version is a substitution of a purine for a pyrimidine or vice 
versa. The probability of a given nucleotide substitution depends on the edge and upon 
the type of substitution. A K2P tree is defined by the triplet (T, {A(e)}, ts/tv), where 
A(e) is the expected number of times a random site will change its nucleotide on e, and 
ts/tv is the transition/transversion ratio. In our experiments, we fix this ratio to 2, one 
of the standard settings. 

It is sometimes assumed that the sites evolve identically and independently down the 
tree. However, we can also assume that the sites have different rates of evolution, and 
that these rates are drawn from a known distribution. One popular assumption is that the 
rates are drawn from a gamma distribution with shape parameter a, which is the inverse 
of the coefficient of variation of the substitution rate. We use a = 1 for our experiments 
under K2PH-Gamma. With these assumptions, we can specify a K2P-i-Gamma tree just 
by the pair (T, {A(e)}). 
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2.3 Statistical Performance Issues 

A phylogenetic reconstruction method is statistically consistent under a model of evo- 
lution if for every tree in that model the probability that the method reconstructs the tree 
tends to 1 as the sequence length increases. Under the assumption of a K2PH-Gamma 
evolutionary process, if the transition/transversion ratio and shape parameter are known, 
it is possible to define pairwise distances between taxa so that distance-based methods 
(such as neighbor-joining) are statistically consistent m . Real biomolecular sequences 
are of limited length. Therefore, the length k of the sequences affects the performance 
of the method M signihcantly. The convergence rate of a method M is the rate at which 
it converges to 100% accuracy as a function of the sequence length. 

2.4 Phylogenetic Reconstruction Methods 

We briefly discuss the two phylogenetic methods we use in our empirical studies: 
neighbor-joining and DCM-NJh-MP. Both methods have polynomial running time. 

Neighbor-Joining: Neighbor-joining lll8i is one of the most popular distance based 
methods. Neighbor-joining takes a distance matrix as input and outputs a tree. For ev- 
ery two taxa, it determines a score, based on the distance matrix. At each step, the 
algorithm joins the pair with the minimum score, making a subtree whose root replaces 
the two chosen taxa in the matrix. The distances are recalculated to this new node, and 
the “joining” is repeated until only three nodes remain. These are joined to form an 
unrooted binary tree. 

DCM-NJ+MP: The DCM-NJ-i-MP method is a variant of a provably fast-converging 
method that has performed very well in previous studies [HI. In these simulation stud- 
ies, DCM-NJh-MP outperforms, in terms of topological accuracy, the methods DCM *- 
NJ (of which it is a variant) and neighbor-joining. 

The method works as follows: let dij be the distance between taxa i and j. 

- Phase 1: For each q G {dij }, compute a binary tree Tq, by using the Disk-Covering 
Method from [0. followed by a heuristic for refining the resultant tree into a binary 
tree. Let T = {Tq : q G {dij}}. (Readers interested in more details of how Phase I 
is handled should see [|)^.) 

- Phase 2: Select the tree from T which optimizes the parsimony criterion. 

If we consider all ( 2 ) thresholds in Phase 1, DCM-NJh-MP takes 0(n®) time. However, 
if we consider only a fixed numberp of thresholds, DCM-NJh-MP takes 0{pn^). 

2.5 Measures of Accuracy 

There are many ways of measuring error between trees. We use the Robinson-Foulds 
(RF) distance HI which is defined as follows. Every edge e in a leaf-labeled tree T 
defines a bipartition TTe on the leaves (induced by the deletion of e), and hence the tree 
T is uniquely encoded by the set C(T) = {we '■ e G E(T)}, where E(T) is the set of 
all internal edges of T. If T is a model tree and T' is the tree obtained by a phylogenetic 
reconstruction method, then the error in the topology can be calculated as follows: 
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- False Positives: C{T') — C{T). 

— False Negatives: C{T) — C{T'). 

The RF distance is > i-e., the average of the false positive and the false 

negative rates. 



3 Theoretical Results on Convergence Rates 

In m, the sequence length requirement for the neighbor-joining method under the 
Cavender-Farris model was bounded from above, and extended to the General Markov 
model in m . We state the result here; 

Theorem 1. (SMl) Let {T,M) be a model tree in the General Markov model. Let 
A(e) = — log |def(Me)|, and set Xij = ^ A(e). 

eePij 



Assume that f is fixed with 0 < / < A(e) for all edges e € T. Let e > 0 be given. 
Then, there are constants C and C ( that do not depend upon f ) such that, for 



k = 



C 



log ne 



C^(max \ij ) 



then with probability at least 1 — e, neighbor-joining on S returns the true tree, where S 
is a set of sequences of length k generated on T. The same sequence length requirement 
applies to the Q* method of 



From Theorem 1 we can see that as the edge length gets smaller, the sequence length 
has to be larger in order for neighbor-joining to return the true tree with high probability. 
Note that the diameter of the tree and the sequence length are “exponentially” related. 



3.1 Fixed-Parameter Analyses of the Convergence Rate 

Analysis when both / and g Are Fixed: In l|Sp21 1. the convergence rate of neighbor- 
joining was analyzed when both / and g are fixed (recall that / is the smallest edge 
length, and g is the largest edge length). In this setting, by Theorem Q and because 
max Xij = 0{gn), we see that neighbor-joining recovers the true tree, with probability 
1 — e, from sequences that grow exponentially in n. An average case analysis of tree 
topologies under various distributions shows that max Ay = 0{g^/n) for the uniform 
distribution and 0{g log n) for the Yule-Harding distribution. Hence, neighbor-joining 
has an average case convergence rate which is polynomial in n under the Yule-Harding 
distribution, but not under the uniform distribution. 

By definition, “fast-converging” methods are required to converge to the true tree 
from polynomial length sequences, when / and g are fixed. The convergence rates of 
fast-converging methods have a somewhat different form. We show the analysis for the 
DCM*-N] method (see ED): 
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Theorem 2. ( H 1\l ) Let (T, M) be a model tree in the General Markov model. Let 
A(e) = — log |(ief(Me)|, andsetXij = ^ A(e). 

eePij 



Assume that f is fixed with 0 < / < A(e) for all edges e G T. Let e > 0 be given. 
Then, there are constants C and C ( that do not depend upon f ) such that, for 



k = 



^ 1 

-p log ne 



C'(width(T)) 



then with probability at least 1 — e, DCM*-NJ on S returns the true tree, where S is 
a set of sequences of length k generated on T, and width(T) is a topologically defined 
function which is bounded from above by max A^ and is also 0{g log n). 

Consequently, fast-converging methods recover the true tree from polynomial length 
sequences when both / and g are fixed. 



Analysis when maxAiy Is Fixed: Suppose now that we fix maxAiy but not /. In this 
case, neither neighbor-joining nor the “fast-converging” methods will recover the true 
tree from sequences whose lengths grow polynomially in n, because as / ^ 0, the 
sequence length requirement increases without bound. However, for “random” birth- 
death trees, the expected minimum edge length is (9(l/n). Hence, suppose that in ad- 
dition to fixing maxAij we also require that / = 0{l/n). In this case, application 
of Theorem^ and Theorem 0 shows that neighbor-joining and the “fast-converging” 
methods all recover the true tree with high probability from 0(n^ log n) -length se- 
quences. The theoretically obtained convergence rates differ only in the leading con- 
stant, which in neighbor-joining’s case depends exponentially on max A ij, while in the 
case of I9CM*-NJ’s this rate depends exponentially on width{T). Thus, the perfor- 
mance advantage of a fast-converging method- from a theoretical perspective- depends 
upon the difference between these two values. We know that width{T) < maxAiy 
for all trees. Furthermore, the two values are essentially equal only when the strong 
molecular clock assumption holds. Note also that when the tree has a low evolutionary 
diameter (i.e., when max A^ is small), then the predicted performance of these methods 
suggests that they will be approximately identical. Only for large evolutionary diame- 
ters should we obtain a performance advantage by using the fast-converging methods 
instead of neighbor-joining. 

In the next section we discuss the empirical performance of these methods. 



4 Earlier Performance Studies Comparing DCM-NJ+MP to NJ on 
Random Trees 

In an earlier study M, we studied the performance of the neighbor-joining (NJ) 
method, and several new variants of the disk-covering method. The DCM-NJh-MP 
method was one of these new variants we tested. Our experiments (some of which 
we present here) showed that for random trees (from the uniform distribution on binary 
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tree topologies) with random branch lengths (also drawn from the uniform distribu- 
tion within some specified range), the DCM-NJh-MP method was a clear improvement 
upon the NJ method with respect to topological accuracy. The DCM-NJ-tMP method 
was also more accurate in many of our experiments than the other variants we tested, 
leading us to conclude that the improved performance on random trees might extend to 
other distributions on model trees. 

Later in this paper we will present new experiments, testing this conclusion on ran- 
dom birth-death trees with a moderate deviation from ultrametricity. Here we present a 
small sample of our earlier experiments, which shows the improved performance and 
indicates how DCM-NJh-MP obtains this improved performance. 

Recall that the DCM-NJh-MP method has two phases. In the first phase, a collection 
of trees is obtained, one for each setting of the parameter q. This inference is based upon 
dividing the input set into overlapping subsets, each of diameter bounded from above by 
q. The NJ method is then used on each subset to get a subtree for the subset, and these 
subtrees are merged into a single supertree. These trees are constructed to be binary 
trees, and hence do not need to be further resolved. This first phase is the “DCM-NJ” 
portion of the method. In the second phase, we select a single tree from the collection 
of trees {Tq : q G dij}, by selecting the tree which has the optimal parsimony score 
(i.e., the fewest changes on the tree). 

The accuracy of this two-phase method depends upon two properties: first, the first 
phase must produce a set of trees so that at least some of these trees are better than 
the NJ tree, and second, the technique (in our case, maximum parsimony) used in the 
second phase must be capable of selecting a better tree than the NJ tree. Thus, the 
first property depends upon the DCM-NJ method providing an improvement, and the 
second property depends upon the performance of the maximum parsimony criterion as 
a technique for selecting from the set {Tq}. In the following figures we show that both 
properties hold for random trees under the uniform distribution on tree topologies and 
branch lengths. 

In Figure QJ we show the results of an experiment in which we scored each of the 
different trees Tq for topological accuracy. This experiment is based upon random trees 
from the uniform distribution. Note that the best trees are significantly better than the NJ 
tree. Thus, the DCM-NJ method itself is providing an advantage over the NJ method. 

In Figure|5|we show the result of a similar experiment in which we compared several 
different techniques for the second phase (i.e., for selecting a tree from the set {Tq}). 
This figure shows that the Maximum Parsimony (MP) technique obtains better trees 
than the Short Quartet Support Method, which is the technique used in the second phase 
of the £>CM*-NJ method. Furthermore, both DCM-NJh-MP and ZJCM*-NJ improve 
upon NJ, and this improvement increases with the number of taxa. 

Thus, for random trees from the uniform distribution on tree topologies and branch 
lengths, DCM-NJh-MP improves upon NJ, and this improvement is due to both the 
decomposition strategy used in Phase 1, and the selection criterion used in Phase 2. 

Note however that DCM-NJh-MP is not statistically consistent, even under the sim- 
plest models, since the maximum parsimony criterion can select the wrong tree with 
probability going to 1 as the sequence length increases. 
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THRESHOLD 

Fig. 1. The accuracy of the T^’s for different values of g on a randomly generated tree 
with 100 taxa, sequence length 1000, and an average branch length of 0.05. 




Fig. 2. DCM-NJ+MP vs. DCM*-^] vs. NJ on random trees (uniform distribution on 
tree topologies and branch lengths) with sequence evolution under the K2P+Gamma 
model. Sequence length is 1000. Average branch length is 0.05. 
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5 New Performance Studies under Birth-Death Trees 

5.1 Introduction 

In this paper we focused upon the question of whether the improvement in performance 
over NJ that we saw in DCM-NJ+MP was a function of the distribution on tree topolo- 
gies and branch lengths (both uniform), or whether we would continue to see an im- 
provement in performance, by comparison to NJ, when we restrict our attention to a 
more biologically based distribution on model trees. Hence we focus on random birth- 
death trees, with some deviation from ultrametricity added (so that the strong molecular 
clock does not hold). As we will show, the improvement in performance is still visible, 
and our earlier claims extend to this case. 

5.2 Experimental Platform 

Machines: The experiments were run on the SCOUT cluster at University of Texas, 
which contains approximately 130 different processors running the Debian Linux oper- 
ating system. We also had nighttime use of approximately 150 Pentium III processors 
located in public undergraduate laboratories. 

Software: We used Sanderson’s r8s package for generating birth-death trees [d 

and the program Seq-Gen C3 to randomly generate a DNA sequence for the root and 
evolve it through the tree under K2P-i-Gamma model of evolution. We calculated evo- 
lutionary distances appropriately for the model (see El). In the presence of saturation 
(that is, datasets in which some distances could not be calculated because the formula 
did not apply), we used the “fix-factor 1” technique, as dehned in [ 0 - In this technique, 
the distances that cannot be set using the standard technique are all assigned the largest 
corrected distance in the matrix. 

The software for DCM-NJ was written by Daniel Huson. To calculate the maximum 
parsimony scores of the trees we used PAUP* 4.0 m. For job management across the 
cluster and public laboratory machines, we used the Condor software package [tZ(®. We 
generated the rest of this software (a combination of C-H- programs and Perl scripts) 
explicitly for these experiments. 

5.3 Bounded Diameter Trees 

We performed experiments on bounded diameter trees, and observed how the error rates 
increase as the number of taxa increases. The birth-death trees that we generated using 
r8s have diameter 2. In order to obtain trees with other diameters, we multiplied the 
edge lengths by factors of 0.01, 0.1, and 0.5, thus obtaining trees of diameters 0.02, 0.2, 
and 1 .0, respectively. Then, to deviate these trees from ultrametricity, we modified the 
edge lengths using deviation factor 4. The resulting trees have diameters bounded from 
above by 4 times the original diameter, but have expected diameters of approximately 
twice the original diameters. Thus, the final model trees have expected diameters that 
are 0.04, 0.4, and 2.0. In this way we generated random model trees with 10, 25, 50, 
100, 200, 400, and 800 leaves. For each number of taxa and diameter, we generated 30 
random birth-death trees (using r8s). 
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5.4 Experimental Design 

For each model tree we generated sequences of length 500 using seq-gen, computed 
trees using NJ and DCM-NJ+MP. We then computed the Robinson-Foulds error rate 
for each of the inferred trees, by comparing it to the model tree that generated the data. 

5.5 Results and Discussion 

In order to obtain statistically robust results, we followed the advice of McGeoch ira 
and Moret O and used a number of runs, each composed of a number of trials (a 
trial is a single comparison), computed the mean and standard deviation over the runs 
of these events. This approach is preferable to using the same total number of samples 
in a single run, because each of the runs is an independent pseudorandom stream. With 
this method, one can obtain estimates of the mean that are closely clustered around the 
true value, even if the pseudorandom generator is not perfect. 

The standard deviation of the mean outcomes in our studies varied depending on 
the number of taxa. The standard deviation of the mean on 10-taxon trees is 0.2 (which 
is 20 percent, since the possible values of the outcomes range from 0 to 1), on 25-taxon 
trees is 0.1 (which is 10 percent), whereas on 200, 400 and 800-taxon trees the standard 
deviation ranged from 0.02 to 0.04 (which is between 2 and 4 percent). We graph the 
average of the mean outcomes for the runs, but omit the standard deviations from the 
graphs. 

In Figure 3, we show how neighbor-joining and DCM-NJ+MP are affected by in- 
creasing the rate of evolution (i.e., the height). The x-axis is the maximum expected 
number of changes of a random site across the tree, and the y-axis is the RF rate. We 
provide a curve for each number of taxa we explored, from 10 up to 800. The sequence 
length is fixed in this experiment to 500. Note that both neighbor-joining and DCM- 
NJ+MP have high errors for the lowest rates of evolution, and that at these low rates 
of evolution the error rates increase as n increases. This is because for these low rates 
of evolution, increasing the number of taxa makes the smallest edge length (i.e., /) 
decrease, and thus increases the sequence length needed to have enough changes on 
the short edges for them to be recoverable. As the rate of evolution increases, the error 
rates initially decrease for both methods, but eventually the error rates begin to increase 
again. This increase in error occurs where the exponential portion of the convergence 
rate (i.e., where the sequence length depends exponentially on max A^) becomes sig- 
nihcant. Note that where this happens is essentially the same for both methods- and 
that they perform equally well until that point. However, after this point, neighbor- 
joining’s performance is worse, compared to DCM-NJ+MP; furthermore, the error rate 
increases for neighbor-joining at each of the “large” diameters, as n increases, while 
DCM-NJ+MP’s error rate does not reflect the number of taxa nearly as much. 

In Figure 4, we present a different way of looking at the data. In this hgure, the 
x-axis is the number of taxa, the y-axis is the RF rate, and there is a curve for each 
of the methods. We show thus how increasing n (the number of taxa) while hxing the 
diameter of the tree affects the accuracy of the trees reconstructed. Note that at low rates 
of evolution (the left hgure), the error rates for both methods increase with the number 
of taxa. At moderate rates of evolution (the middle hgure), error rates increase for both 
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methods but more so for neighbor-joining than for DCM-NJ-tMR Finally, at the higher 
rate of evolution (the right figure), this trend continues, but the gap is even larger - in 
fact, DCM-NJh-MP’s error increase looks almost flat. 

These experiments suggest strongly that except for low diameter situations, the 
DCM-NJh-MP method (and probably the other “fast-converging” methods) will out- 
perform the neighbor-joining method, especially for large numbers of taxa and high 
evolutionary rates. 




Fig. 3. NJ (left graph) and DCM-NJ-tMP (right graph) error rates on random birth-death 
trees as the diameter (x-axis) grows. Sequence length fixed at 500, and deviation factor 
hxed at 4. 



Table 1 shows the average running times of neighbor-joining and DCM-NJh-MP on 
the trees that we used in the experiments. The DCM-NJh-MP version that we ran looked 
at 10 thresholds in Phase 1 instead of looking at all the ( 2 ) thresholds. 



6 Conclusion 

In an earlier study we presented the DCM-NJh-MP method and showed that it outper- 
formed the NJ method for random trees drawn from the uniform distribution on tree 
topologies and branch lengths. In this study we show that this improvement extends to 
the case where the trees are drawn from a more biologically realistic distribution, in 
which the trees are birth-death trees with a moderate deviation from ultrametricity. This 
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Fig. 4. NJ and DCM-NJ+MP: Error rates on random birth-death trees as the number 
of taxa (x-axis) grows. Sequence length hxed at 500 and the deviation factor at 4. The 
expected diameter of the resultant trees are 0.02 (for the left graph), 0.2 (for the middle 
graph), and 1.0 (for the right graph). 

Table 1. The Running Times of NJ and DCM-NJ+MP in Seconds. 



Taxa 


NJ 


DCM-NJ+MP 


10 


0.01 


1.94 


25 


0.02 


9.12 


50 


0.06 


24.99 


100 


0.35 


132.46 


200 


2.5 


653.27 


400 


20.08 


4991.11 


800 


160.4 


62279.3 



study has consequences for large phylogenetic analyses, because it shows that the accu- 
racy of the NJ method may suffer significantly on large datasets. Furthermore, since the 
DCM-NJ+MP method has good accuracy, even on large datasets, our study suggests 
that other polynomial time methods may be able to handle the large dataset problem 
without significant error. 
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Abstract. Gu et al. gave a 2-approximation for computing the minimal number 
of inversions and transpositions needed to sort a permutation. There is evidence 
that, from the point of view of computational molecular biology, a more ade- 
quate objective function is obtained, if transpositions are given double weight. 
We present a (1 -1- e)-approximation for this problem, based on the exact algo- 
rithm of Hannenhalli and Pevzner, for sorting by reversals only. 



1 Introduction 

This paper is concerned with the problem of sorting permutations using long range 
operations like inversions (reversing a segment) and transpositions (moving a segment). 
The problem comes from computational molecular biology, where the aim is to find a 
parsimonious rearrangement scenario that explains the difference in gene order between 
two genomes. In the late eighties, Palmer and Herbon [0 found that the number of such 
operations needed to transform the gene order of one genome into the other could be 
used as a measure of the evolutionary distance between two species. 

The kinds of operations we consider are inversions, transpositions and inverted 
transpositions. Hannenhalli and Pevzner IQ showed that the problem of finding the 
minimal number of inversions needed to sort a signed permutation is solvable in poly- 
nomial time, and an improved algorithm was subsequently given by Kaplan et al. [0. 
Caprara, on the other hand, showed that the corresponding problem for unsigned per- 
mutations is NP-hard 0. For transpositions no such sharp results are known, but the 
(3/2)-approximation algorithms of Bafna and Pevzner [Q and Christie B are worth 
mentioning. 

Moving on to the combined problem, Gu et al. [H gave a 2-approximation algo- 
rithm for the minimal number of operations needed to sort a signed permutation by 
inversions, transpositions and inverted transpositions. However, an algorithm looking 
for the minimal number of operations will produce a solution heavily biased towards 
transpositions. Instead, we propose the following problem: find the 7r-sorting scenario 
s (i.e., transforming tt to the identity) that minimizes inv{s) + 2trp{s), where inv{s) 
and trp{s) are the numbers of inversions and transpositions in s, respectively. 

We give a closed formula for this minimal weighted distance. Our formula is sim- 
ilar to the exact formula for the inversion case, given by Hannenhalli and Pevzner [Q. 



O. Gascuel and B.M.E. Moret (Eds.): WABI 2001, LNCS 2149, pp. 22.7- I7T71 2001. 
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We also show how to obtain a polynomial time algorithm for computing this formula 
with an accuracy of (1 + e), for any e > 0. As an example, we explicitly state a 7/6- 
approximation. We also argue that for most applications the algorithm performs much 
better than guaranteed. 



2 Preliminaries 

Here we present some useful definitions from Bafna and Pevzner [Q and Hannenhalli 
and Pevzner as well as a couple of new ones. 

In this paper, we work with signed, circular permutations. We adopt the con- 
vention of reading the circular permutations counterclockwise. We will sometimes lin- 
earize the permutation by inverting both signs and reading direction if it contains - 1 , 
then making a cut in front of 1 and finally adding n+l last, where n is the length of the 
permutation. An example is shown in Figure |2 A breakpoint in a permutation is a pair 

3 2 4 -6 









1-6 4-5 -3 -2 7 



-4 6 -3 -2 

Fig. 1. Transforming a Circular Permutation to a Linear Form. 



of adjacent genes that are not adjacent in a given reference permutation. For instance, 
if we compare a genome to the identity permutation and consider the linearized version 
of the permutation, the pair (tt^, TVi+i) is a breakpoint if and only if tt^+i — tt^ ^ 1. For 
unsigned permutations, this would be written |7Ti_|_i — 7Ti| 

The three operations we consider are inversions, transpositions and inverted transpo- 
sitions These are defined in Figure |3 Following 07151 . we transform a signed, circular 





Fig. 2. Definitions of inversion, transposition and inverted transposition on signed 
genomes. If we remove all signs, the definition holds for unsigned genomes. 
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permutation tt on n elements to an unsigned, circular permutation tt ' on 2n elements as 
follows. Replace each element x in tt by the pair (2x — 1, 2x) if x is positive and by the 
pair (—2x, —2x — 1) otherwise. An example can be viewed in Figure 0 Then, to each 
operation in tt there is a corresponding operation in tt', where the cuts are placed after 
even positions. We also see that the number of breakpoints in tt equals the number of 
breakpoints in tt'. Define the breakpoint graph on tt' by adding a black edge between 



-7 

-5 1 




-4 -8 

-3 



10 



13 14 



12 

11 



15 



16 



Fig. 3. Transforming a signed permutation of length 8 to an unsigned permutation of 
length 16. 



7t' and if there is a breakpoint between them and a grey edge between 2i and 2i+ 1, 
unless these are adjacent (Figure^. These edges will then form alternating cycles. The 
length of a cycle is the number of black edges in it. Sometimes we will also draw black 
and grey edges between 2i and 2i + 1, even though these are adjacent. We will then 
get a cycle of length one at the places where we do not have a breakpoint in tt (these 
cycles will be referred to as short cycles). A cycle is oriented if, when we traverse it, 
at least one black edge is traversed clockwise and at least one black edge is traversed 
counterclockwise. Otherwise, the cycle is unoriented. 

Consider two cycles ci and C 2 . If we can not draw a straight line through the circle 
such that the elements of ci are on one side of the line and the elements of C 2 are 
on the other side, then these two cycles are inseparable. This relation is extended to 
an equivalence relation by saying that ci and Cj are in the same component if there 
is a sequence of cycles ci, C 2 , . . . , Cj such that, for all 1 < i < j — 1, Ci and Ci+i are 
inseparable. 

A component is oriented if at least one of its cycles is oriented and unoriented 
otherwise. If there is an interval on the circle, which contains an unoriented component, 
but no other unoriented components, then this component is a hurdle. If we cannot 
remove a hurdle without creating a second hurdle upon its removal (this is the case if 
there is an unoriented component, which is not a hurdle, that stretches over an interval 
that contains the previously mentioned hurdle, but no other hurdles), then the hurdle is 
called a super hurdle. If we have an odd number of super hurdles and no other hurdles, 
the permutation is known as a fortress. 
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^ I 

11 i 

I 




/ I 
/ I 




Fig. 4. The breakpoint graph of a transformed permutation. It contains three cycles, two 
of length 2 and one of length 3. The latter constitutes one component, and the hrst 
two constitute another component. The first component is unoriented and the second is 
oriented. Both components have size 2. 

We should observe that for components in the breakpoint graph, the operations 
needed to remove them do not depend on the actual numbers on the vertices. We could 
therefore treat the components as separate objects, disregarding the particular permuta- 
tion they are part of. If we wish, we can also regard the components as permutations, 
by identifying them with (one of) the shortest permutations whose breakpoint graphs 
consist of this component only. For example, the 2-cycle component in Figure |3]can be 
identified with the permutation 1 - 3 - 4 2 5 . 

We say that an unoriented component is of odd length if all cycles in the component 
are of odd length. We let &(7t), c(7t) and /i(7r) denote the number of breakpoints, cycles 
(not counting cycles of length one) and hurdles in the breakpoint graph of a permutation 
7T, respectively. For components t, b{t) and c{t) are defined similarly. The size s of a 
component t is given by b{t) — c{t). We also let Cs(7t) denote the number of cycles 
in 7T, including the short ones, and /(tt) is a function that is 1 is tt is a fortress, and 0 
otherwise. 

3 Expanding the Inversion Formula 

Let denote the set of all scenarios transforming tt into id using inversions only and 
let inv{s) denote the number of inversions in a scenario s. The inversion distance is 
dehned as dinv(.T^) = mingg 5 / {mu(s)}. It has been shown in iDlHIl that 



dinv(T^) = b{n) - c{n) -|- h{Tr) + /(tt), 
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where 6(7t), c(7t), /i(7t) and /(tt) have been defined in the previous paragraph. In this 
paper, we define the distance between tt and id by 

c?(7t) = min {inr;(s) + 2 trp(s)}, 

where St^ is the set of all scenarios transforming tt into id, allowing both inversions 
and transposition, and inv{s) and trp{s) is the number of inversions and transposi- 
tions in scenario s, respectively. Here, transpositions refer to both ordinary and inverted 
transpositions. 

In order to give a formula for this distance, we need a few definitions. 

Definition 1. Regard all components t as permutations and let d{t) be the distance 
between t and id as defined above for permutations. Consider the set S of components 
t such that d{f) > b{t) — c{f) (when using inversions only, this is the set of unoriented 
components). We call this set the set of strongly unoriented components. If there is 
an interval on the circle that contains the component t € S, but no other member of S, 
then t is a strong hurdle. Strong super hurdles and strong fortresses are defined in 
the same way as super hurdles and fortresses (just replace hurdle with strong hurdle). 



Observation 1. In any scenario, each inverted transposition can be replaced by two 
inversions, without affecting the objective function. This means that in calculating d('jt), 
we need not bother with inverted transpositions. Therefore, we will henceforth consider 
only inversions and ordinary transpositions. 



Lemma 1. Each strongly unoriented component is unoriented (in the inversion sense). 

Proof. We know that for oriented components t, dinvif) = b(t) — c(t) and for any 
permutation tt, we have d(7r) < d/„t,(7r). Regarding the component f as a permutation 
gives d(t) < dinv{t). Thus, for strongly unoriented components we have dmv{t) > 
d(t) > b(f) — c(t) and we can conclude that a strongly unoriented component can not 
be oriented. 



Theorem 2. The distance d(Tr) defined above is given by 

d(n) = 5(7t) - c(7t) -I- htitr) + ftM, 
or, equivalently (counting short cycles as well), 

d(TT) = n- Cs(tt) + ht{tt) + /i(7r), 

where ht (tt) is the number of strong hurdles in tt, ft (tt) is 1 if Tt is a strong fortress 
(and 0 otherwise) and n is the length of the permutation it. 

Proof. It is easy to see that (i(7r) < n— C s(7t)-|- fit (7r)-|-/t(7r). If we treat the strong hur- 
dles as in the inversion case, we need only fi t (tt) -f /t (tt) inversions to make all strongly 
unoriented components oriented. All oriented components can be removed efficiently 
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using inversions, and the unoriented components which are not strongly unoriented can, 
by definition, be removed efficiently. 

We now need to show that we can not do better than the formula above. From 
Hannenhalli and Pevzner we know that we can not decrease n — Cs{tt) by more than 
1 using an inversion. Similarly, a transposition will never decrease n — Cs(7t) by more 
than 2, which is obtained by splitting a cycle in three cycles. The question is whether 
transpositions can help us to remove strong hurdles more efficiently than inversions. 

Bafna and Pevzner have shown that applying a transposition can only change the 
number of cycles by 0 or ±2. There are thus three possible ways of applying a transpo- 
sition. First, we can split a cycle into three parts (Acg = 2). If we do this to a strong 
hurdle, at least one of the components we get must by dehnition remain a strong hur- 
dle, since otherwise the original component could be removed efficiently. This gives 
Aht = 0. Second, we can let the transposition cut two cycles {Acg = 0). To decrease 
the distance by three, we would have to decrease the number of strong hurdles by three 
which is clearly out of reach (only two strong hurdles may be affected by a transposi- 
tion on two cycles). Finally, if we merge three cycles (Acs = —2), we would need to 
remove five strong hurdles. This clearly is impossible. 

It is conceivable that the fortress property could be removed by a transposition that 
reduce n — Cs{tt) + ht{Tr) by two and at the same time removes an odd number of 
strong super hurdles or adds a strong hurdle that is not a strong super hurdle. However, 
from the analysis above, we know the transpositions that decrease n — Cs(7t) + ht{'K) 
by two must decrease ht{iT) by an even number. We also found that when this was 
achieved, no other hurdles apart from those removed were affected. Hence, there are no 
transpositions that reduce n — Cs{tt) + ht{'K) + /((tt) by three. 

We find that d(7r) > n — Cs{tt) + ht{Tr) + /((tr), and in combination with the first 
inequality, d{n) = n - Cs{tt) + htin) + /t(7r). 



1 4 



' / 
\ / 

\ / 

/\ 




Fig. 5. The breakpoint graph of a cycle of length three which can be removed by a single 
transposition. 
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3.1 The Strong Hurdles 

Once we have identified all strongly unoriented components in a breakpoint graph, we 
are able to calculate the number of strong hurdles. We thus need to look into the question 
of determining which components are strongly unoriented. 

From the lemma above, we found that all strongly unoriented components are un- 
oriented. The converse is not true. One example of this is the unoriented cycle in Figure 
0 which can be removed with a single transposition. However, many unoriented com- 
ponents are also strongly unoriented. Most of them are characterized by the following 
lemma. 

Lemma 2. If an unoriented component contains a cycle of even length, then it is 
strongly unoriented. 

Proof Since the component is unoriented, applying an inversion to it will not increase 
the number of cycles. If we apply a transposition to it, it will remain unoriented. Thus, 
the only way to remove it efficiently would be to apply a series of transpositions, all 
increasing the number of cycles by two. 

Consider what happens if we split a cycle of even length into three cycles. The sum 
of the length of these three new cycles must equal the length of the original cycle, in 
particular it must he even. Three odd numbers never add to an even number, so we must 
still have at least one cycle of even length, which is shorter than the original cycle. 

Eventually, the component must contain a cycle of length 2. There are no transposi- 
tions reducing h{f) — c{t) by 2 that can he applied to this cycle, and hence the component 
is strongly unoriented. 

Concentrating on the unoriented components with cycles of odd lengths only, we find 
that some of these are strongly unoriented and some are not. For instance, there are two 
unoriented cycles of length three. One of them is the cycle in which we may remove 
three breakpoints (Figure 0) and the other one can be seen in Figure]^ (a). Note that this 
cycle can not be a component. This is, however, not true for the components in Figure ^ 
(b) and (c), which are the two smallest strongly unoriented components of odd length. 




(a) 



(b) 



(c) 



Fig. 6. A cycle of length three which can not be removed by a transposition (a), the 
smallest strongly unoriented component of odd length (b) and the second smallest 
strongly unoriented component of odd length (c). 
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4 The (7/6) -Approximation and the (1 + e) -Approximation 

Even though we at this stage are unable to recognize all strongly unoriented components 
in an efficient manner, we are still able to approximate the distance reasonably well. 
We will first show that our identification of the two strongly oriented components of 
size less than 6 that contain odd cycles exclusively will give us a 7/6-approximation 
(remember that the size of a component t was defined as b{t) — c{t)). We will then 
show that if we have identified all odd strongly unoriented components of size less than 
k, we can make a (1 -f e) -approximation for e = 1/k. 

First we look at the case when we know for sure that tt is not a fortress, and then we 
look at the case when tt may be a fortress. 

4.1 If TT Is not a Fortress, We Have a 7/6- Approximation 

For all odd unoriented components with size less than 6, we are able to distinguish be- 
tween those that are strongly unoriented and those that are not. In fact, the only strongly 
unoriented components in this set can be found in Figure 0(b) and (c). Thus, the small- 
est components that may be wrongly deemed as strong hurdles are those of size 6. 

Fet /i„(7t), Cu{tt) and 6u(7r) be the number of components, the number of cycles and 
the number of breakpoints among the odd unoriented components of size 6 or larger, 
respectively. It is clear that huij:) < Cu(7r) and huijr) < , Let boijr) and 

Co(7t) denote the number of breakpoints and cycles, respectively, among all other com- 
ponents (that is, the components that we know whether they are strongly unoriented or 
not). Also, let /i„one(7r) denote the number of hurdles we would have if none of the 
large odd unoriented components are strongly unoriented and let h aii (tt) denote the 
number of hurdles we would have, if all of these are strongly unoriented. It follows that 
hnoneiT^) < ht{Tr) < hall{Tr) < hnone{T^) + This giveS 

d{n) = b{n) — c{n) + ht{Tr) 

< boiir) + 6„(7t) - Coin) 

< boin) - Coin) + buin) 

< boin) - Coin) -I- buin) 

and 



- Cuin) + haiiin) 

- Cuin) + huonein) + Kin) 

„ , r , Kin) -Cuin) 

'^none\'^} g 



d{7r) = 6 ( 7 t ) — c ( 7 t ) + ht{7r) 

> boin) + buin) - Coin) - c„(7t) -f huonein) 



and hence (putting doin) = boin) - Coin) + Konein)) 



din) € 



doin) -I- buin) - c„(7t) , doin) 



Tjbujn) - Cujn)) ' 

6 



In most situations, doin) will be quite large compared to buin) — c„(7r) and then 
the approximation is much better than 7/6. Thus, in practice we may use this algorithm 
to get a reliable value for din). 
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4.2 If 7T May Be a Fortress, We Still Have a 7/6-Approximation 

The analysis is similar to the one in the previous case. To simplify things a bit, we 
look at the worst case. The effect of tt being a fortress is most significant if (/(tt) is 
small. We need an odd number of strong super hurdles, and no other strong hurdles, 
to make a fortress. It takes two strong hurdles to form a strong super hurdle, and one 
strong super hurdle can not exist by itself. Thus, we need at least six strongly unoriented 
components, arranged in pairs, covering disjoint intervals of the circle. 

We consider the case where we have six components, arranged such that we have 
three possible strong super hurdles. For each of these three pairs, there are three possible 
cases. If we know that we have a strong super hurdle, then we know that, for each 
component, b — c> 2 (there are no components with b — c < 2). Thus, for the pair we 
have b — c+ h > 5. If we know that one of the two components is strongly unoriented, 
but we are not sure about the other, then we know that we have a strong hurdle and for 
the second component, we know that &—C > 6. Together this gives b—c+h > 9. Finally, 
if we are ignorant to whether any of the two components are strongly unoriented, we 
do not even know whether the pair constitutes a strong hurdle. Since both components 
fulfill 6 — c > 6, we get b — c + h G [r, r + 1], where r > 12. The worst cases is when 
are totally ignorant in each of the three pairs and r = 12. In that case, we get 

d(7r) G [3-12, 3-13+1] = [36,40], 

and since this is the worst case, we have a (10/9)-approximation, which is better than 
7/6. Again, this ratio will be significantly smaller in most applications. 



4.3 The (1 + e) -Approximation 



In order to improve on the (7/6)-approximation, we need to be able to identify strong 
hurdles among larger components. Since we have not yet found an easy way to do this, 
we content ourselves with creating a table of all unoriented components of a certain 
size, which are not strongly unoriented. The table could be created using, for instance, 
an exhaustive search. 

Given a table of all such components of size less than k, and a component t of size 
less than k, we will be able to tell if t is strongly unoriented or not. Thus, applying the 
same calculations as in the 7/6 case above, we find that (if tt is not a fortress) 



C?(7t) G 



doiir) + buiir) 



c „( 7 t ) , doiir) + 



(fc + 1)(6„(7 t) 

k 



C„(7t)) 



or (if 7T may be a fortress, worst case (for k > 10, the worst case is different from the 
worst case for k = 6)) 



d{7T) G [2fc + 2 • 5, 2/c + 1 + 2 • 5 + 1] , 



We clearly have 



and 



1 • fc + 1 , . , 1 

iim — - — = hm 1 + -- = 1 
k—^oo k k—^oo k 



2/C + 12 ^ 1 

lim — = lim 1 + 

k^oo 2ik 10 k—^oo Aj -h 5 



= 1 . 
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5 The Algorithm 

We will now describe the (7/6)-approximation algorithm, which is easily generalized 
to the 1 + e case. First remove, by applying a sequence of optimal transpositions, all 
odd unoriented components of size less than six, that are not strongly unoriented. This 
can be done as follows: Find a black edge such that its two adjacent grey edges are 
crossing. For these small components, this can always be done. Cut this black edge and 
the two black edges that are adjacent to the mentioned grey edges. This transposition 
will always reduce the distance by two , and for these components, we can always 
continue afterwards in a similar fashion. In the 1 + £ case, we would have to use the 
table to find out which transposition to use. 

After removing these components, we can apply the inversion algorithm of Han- 
nenhalli and Pevzner. The complexity of this algorithm is polynomial in the length of 
the original permutation, as is the first step of our algorithm, since identifying the un- 
oriented components that are not strongly unoriented and removing them can be done 
in linear time. Computing the inversion distance can be done in linear time and thus 
an approximation of the combined distance can also be computed in linear time. 

To get a (1 + £) -approximation, all we have to do is to tabulate all odd oriented 
components of size i, that are not strongly unoriented. We also need to tabulate, for 
each such component, a sequence of transpositions that will remove the component 
efficiently. It is clear that the algorithm is still polynomial, since looking up a component 
in the table is done in constant time (for each e). 



6 Discussion 

The algorithm presented here relies on the creation of a table of components that can be 
removed efficiently. Could this technique be used to find an algorithm for any similar 
sorting problem such as sorting by transpositions? In general, the answer is no. In this 
case, as for sorting with inversions, we know that if a component can not be removed 
efficiently, we need only one extra inversion. We also know that for components that 
can be removed efficiently, we can never improve on such a sorting by combining com- 
ponents. For sorting by transpositions, no such results are known and until they are, the 
table will need to include not only some of the components up to a certain size, but 
every permutation of every size. 

The next step is obviously to examine if there is an easy way to distinguish all 
strongly unoriented components. For odd unoriented components, this property seems 
very elusive. It also seems hard to discover a useful sequence of transpositions that 
removes odd oriented components that are not strongly unoriented. However, investi- 
gations on small components have given very promising results. For cycles of length 7, 
we have the following result: If the cycle is not a strongly unoriented component, then 
no transposition that increase the number of cycles by two will give a strongly unorl- 
ented component. This appears to be the case for cycles of length 9 as well, but no fully 
exhaustive search has been conducted, due to limited computational resources. 

If this pattern would hold, we could apply any sequence of breakpoint removing 
transpositions to a component, until we either have removed the component, or are 
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unable to find any useful transpositions. In the first case, the component is clearly not 
strongly unoriented, and in the second case it would be strongly unoriented. 
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Abstract. In this paper, we study the Reversal Median Problem (RMP), which 
arises in computational biology and is a basic model for the reconstruction of 
evolutionary trees. Given q genomes, RMP calls for another genome such that 
the sum of the reversal distances between this genome and the given ones is min- 
imized. So far, the problem was considered too complex to derive mathematical 
models useful for its practical solution. We use the graph theoretic relaxation of 
RMP that we developed in a previous paper (^, essentially calling for a perfect 
matching in a graph that forms the maximum number of cycles jointly with q 
given perfect matchings, to design effective algorithms for its exact and heuris- 
tic solution. We report the solution of a few hundred instances associated with 
real-world genomes. 



1 Introduction 

The problem of reconstructing evolutionary trees from genome sequences is of great 
interest in computational biology [HU- Formally, given a set of genomes, each repre- 
senting a species and defined by a sequence of genes, and a notion of distance between 
genome pairs, the problem calls for a Steiner tree in the genome space, with the given 
genomes defining the Steiner node set (the other nodes in the tree are intended to repre- 
sent species from which the given species have evolved). Distance is aimed at modeling 
the number of evolutionary changes (also called elementary operations) that led from 
one genome to another. There are several types of elementary operations to be con- 
sidered m, such as reversals (also called inversions), transpositions, translocations, 
deletions, insertions, etc. Defining suitable notions of distance and finding algorithms 
to compute distances has been a challenging topic in the 90s. In particular, much work 
has been done in the simplified model in which all genomes contain the same set of 
genes, all genes appearing within a genome are pairwise different (i.e., there are no 
gene duplications) and elementary operations are of a unique type. For this case, the 
most realistic notion of distance appears to be the reversal (or inversion) distance, cor- 
responding to the minimum number of reversals that transform one genome into another 

fTM . 

A breakthrough result in computational biology states that, if the orientation of the 
genes within the genomes is known, the reversal distance can be computed in polyno- 
mial time [(El- Actually, the time is even linear in the genome size by using the method 
of JD. Therefore, being more realistic than other distances and efficient to compute, 
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one would expect the reversal distance to be employed when evolutionary trees are re- 
constructed (still under the simplifying assumptions of genomes with the same genes, 
no gene duplication and elementary operations of a unique type). This is not the case, 
probably because the mathematical modeling of the problem of finding the “best” tree 
w.r.t. the reversal distance is considered to be too complex, giving as motivation the fact 
that, though efficient, computation of the distance between two permutations is not at 
all trivial (actually its complexity was open for a while until the result of dni). The only 
attempts so far to use the reversal distance within evolutionary tree reconstruction can 
be found in 1111251171 . 

For the reasons above, the use of a simpler (though less realistic) notion of distance, 
called breakpoint distance, which is trivial to compute, has been proposed in [El and 
extensively studied afterwards ll3l?.ni?.3l1RIRI7.1l5l . In particular, much work has been 
done on the so-called Breakpoint Median Problem (BMP), which is the problem of find- 
ing a genome which is closest to a given set of genomes w.r.t. the breakpoint distance, 
i.e., the sum of the breakpoint distances between the genome to be found and each given 
genome is minimized. All the methods to reconstruct evolutionary trees 1312311 81 use 
as a subroutine a procedure to solve BMP, either exactly or heuristically, in order to find 
the “best” genome associated with a given tree node once the genomes associated with 
the neighbors of the node are fixed. It is easy to show [El that BMP is a special case of 
the Traveling Salesman Problem (TSP). 

This paper is aimed at considering the use of the reversal distance in the reconstruc- 
tion of evolutionary trees, mainly focusing on the Reversal Median Problem (RMP), 
the counterpart of BMP in which the reversal distance is used instead of the breakpoint 
distance. In [(^ we described a graph theoretic relaxation of RMP which allowed us 
to prove that the problem is NV-hard (actually also AVX-\\ard). Essentially all pa- 
pers dealing with BMP mention RMP as a more realistic model, say that it is NV-hard 
citing O, and motivate the use of BMP by sentences like “For RMP, there are no al- 
gorithms available, aside from rough heuristics, for handling even three relatively short 
genomes” or “Even heuristic approaches for RMP work well only for small in- 

stances” lEd. Our ultimate goal is to show that, although RMP is A(7^-hard and 
nontrivial to model, instances of the problem associated with real-world genomes can 
be solved to (near-)optimality within short computing time. 

In this paper we first recall the graph theoretic relaxation of RMP given in [0, in 
Section 0 In 01 we also presented an Integer Linear Programming (ILP) formulation 
for this relaxation, hoping that this ILP would have allowed us to solve to proven op- 
timality real-world RMP instances. Actually, this was not the case, as we discuss in 
the present paper — the exact algorithm that we will propose, presented in Section 0 
is not based on this ILP. However, a careful use of the LP relaxation of the ILP is the 
key part of the best heuristic algorithm that we have developed, illustrated in Section 
0 Experimental results are given in Section 0 where we show that our methods can 
solve to proven optimality many instances associated with real-world genomes, even of 
relatively large size, and find provably good solutions for the remaining instances. 



2 Preliminaries 

A genome without gene duplications for which the orientation of the genes is known can 
be represented by a signed permutation it on N := n}, obtained by signing a 
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permutation r = (ri ... Tn) on N, i.e., by replacing each element by either = +Ti 
or 7Ti = —Ti. In particular, signs model the relative orientation of the genes within the 
genome. We denote by En the set of the 2"n! signed permutations on N. A reversal of 
the interval (i,j), ^ < i < j < n, applied to a signed permutation tt, is an operation 
which both inverts the subsequence tTi+i . . . ttj-i ttj and switches the signs of the 
elements in the subsequence, replacing tti . . . 7Ti_i tt^ tt^+i . . . iTj-i iTj ttj+i . . . 7t„ 
by 7Ti ... 7Ti_i — TTj — TTj-i ... — 7Ti_|_i — TTi TTj+i ... 7T„. The minimum num- 
ber of reversals needed to transform a signed permutation tt ^ into a signed permutation 
7T^ (or vice versa) is called the reversal distance between and tt^, here denoted by 
d(7T^ , 7T^). Given two signed permutations, the problem of Sorting By Reversals (SBR) 
calls for hnding the reversal distance between the two permutations and an associated 
shortest sequence of reversals. SBR was shown to be polynomially solvable in [O. 
At present, the best asymptotic running time is 0{n?), achieved by the algorithm of 
m. Nevertheless, as mentioned above, the simple computation of the reversal dis- 
tance (without an associated sequence of reversals) can be carried out in 0{n) time 
m . Throughout the paper we will only work with signed permutations, therefore for 
convenience we use the term permutation to indicate a signed permutation. 

Given q permutations tt ^ , . . . , tt'^ G g > 3, representing genomes with the same 
set of genes, RMP calls for a permutation a € En such that S{a) := X]fc=i ^(®’> 
minimized. The remainder of the section gives the essential notions from the previous 
works on SBR and RMR 

The graphs we consider are in general multigraphs, possibly having parallel edges 
with common endpoints. More generally, we consider multisets, which may contain 
distinct copies of identical elements. Given a node set V we call an edge set M C 
■ i,j G V,i ^ j} a matching of V if each node in V is incident to at most 
one edge in M. If each node in V is incident to exactly one edge in M, the matching is 
called perfect. In general, we work with a graph G = {V, E) and an associated (perfect) 
matching M of V without requiring MCE. Cycles and paths are thought of as subsets 
of : i,j G V,i f j}, by implicitly assuming that they are simple, i.e., they do 

not visit a same node twice or more. The length of a cycle or path is given by the 
number of its edges, and we consider “degenerate” cycles of length 2 formed by pairs 
of parallel edges. A Hamiltonian cycle of F is a cycle visiting all the nodes in V. Given 
two matchings M, L of V, the graph (V, M U L) corresponds to a set of node-disjoint 
paths and cycles; we call the latter the cycles defined by M U L. If both M and L are 
perfect, the graph {V, M U L) corresponds to a set of node-disjoint cycles visiting all 
the nodes in V. 

Consider the node set C := {0, 1, . . . , 2n, 2n-|-l},and the associated perfect match- 
ing H := {{2i — l,2i) : i = 1, ... ,n} U {(0, 2n + 1)}, called the base matching of 
V. For convenience, let h. := n -f 1 denote the cardinality of any perfect matching of 
V. There is a natural correspondence between signed permutations in En and perfect 
matchings M of V such that M U H dehnes a Hamiltonian cycle of V. These match- 
ings are called permutation matchings. In particular, the permutation matching M{tt) 
associated with a permutation tt G En is defined by 

M(7t) := {(2|7Ti| - u{7r^),2\iTi+i\ -1 + u{7Ti+i)) : i G {0, . . . ,n}}, (1) 

where for convenience we have dehned ttq 0, tTu+i n + 1 and u{7Ti) 0 if 
TTi > 0, u{TTi) := 1 if 7Ti < 0. Figure[|]illustrates the permutation matchings associated 
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Fig.l. The MB Graph G(7r^,7r^,7r^) Associated with = (2 —3 l),7r^ = (3 —1 —2) 
and 7T^ = (1 — 2 3). 



with 7T^ = (2 — 3 1), 7T^ = (3 — 1 — 2) and = (1 — 2 3). The following result is 
well known and used in almost all papers on SBR. 

Proposition!, M{-) defines a one-to-one mapping between signed permutations 
in Sn and permutation matchings ofV. 

Given two permutations and the set of edges U dehnes a set 

of cycles whose edges are alternately in and in (possibly containing 

cycles of length 2). Let c(7t^ , tt^) denote the number of these cycles. For permutations 
TT^jTT^jTT^ in Figure Ql d(7r^,7r^) = 2, c(7r^,7r^) = 2, d(7r^,7r^) = d(7r^,7r^) = 3, 
c(7r^,7T^) = c(7T^,7r^) = 1. A fundamental result proved in nai, restated according 
to our notation, is the following 

Proposition 2. ( l\16MI ) For two permutations G En, , tt'^) > n— c(7t^, tt^). 

We will call n — c(7T^, 7T^) the cycle distance hetwesmr^ andrr^ or between M(7 t^) and 
M (tt^) . In practice, the lower bound on the reversal distance given by the cycle distance 
turns out to be very strong. It is convenient to extend the notion of cycle distance also 
to general perfect matchings. Given two perfect matchings T,SofV (not necessarily 
permutation matchings), let c{T, S) denote the number of cycles dehned by T U S'. The 
cycle distance between T and S is given by fi — c(T, S) . It is easy to verify that this is 
indeed a distance, for it satisfies the triangle inequality. 

Let Q ■— {1, . . . , g}. We define the MB graph associated with permutations tt ^ , . . . , 
G En as the graph with node set V and edge multiset U 

. . . U M{Trt). Note that possible common edges in and j k, are 

considered distinct parallel edges in . . . , tt^). Figure [I] represents the MB graph 

G(7t^, 7T^, 7T^) associated with tt^ = (2 — 3 1), tt^ = (3 — 1 — 2) and = (1 — 2 3). 

For a given RMP instance defined by permutations . . . , G En, and a per- 
mutation cr G En, let 7 (cr) := Theorem 0 immediately implies that 

S{a) — ^ ~ 7('^)- Together with Theorem^ this suggests to study 

the following Cycle Median Problem (CMP): Given permutations tt^, . . . , G En, 
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Fig. 2. The Contraction of Edge (z, j). 



find a permutation T G such that gn—7(r) isminimized. IntermsofG(7r^, . . . ,7r'^), 
the problem calls for a permutation matching T such that qh — ^ 

minimized. The correspondence between solutions r and T is given by M(r) = T. The 
immediate generalizations of Theorem | 2 | for RMP is the following. Let < 5 * and qfi — 7* 
denote the optimal solution values of RMP and CMP, respectively. 

Proposition 3. ( ) Given an RMP instance and the associated CMP instance, 5* > 
qh — 7*. 

For the MB graph in Figure QJ an optimal solution of both RMP and CMP is given 
by 7T^, for which <5(7 t^) = d(7r^, tt^) + d(7r^, tt^) + d(7r^, tt^) = 0 + 3 + 2 = 5 and 
7(7T^) = c(7r^,7r^) + c(7r^,7T^) + c(7T^,7r^) = 4+1 + 2 = 7 . Hence, = 5 and 
7* = 7 . 

The strength of the cycle distance lower bound on the reversal distance suggests that 
the solution of CMP should yield a strong lower bound on the optimal solution value 
of RMP. CMP is indeed the key problem addressed in this paper to derive effective 
algorithms for RMP. 

We conclude this section with an important notion that will be used in the next 
sections. Given a perfect matching M on node set + and an edge e = (z, j) G {{i,j) ■ 
i,j G V,i ^ j}, we let Mje be defined as follows. If e G M, Mje := M \ {e}. 
Otherwise, letting (z,a), {j,b) be the two edges in M incident to i and j, M/e := 
M \ {(z, a), (j, b)} U {(a, 5 )}. The following obvious lemma will be used later in the 
paper. 

Lemma 1. Given two perfect matchings M, L ofV and an edge e = {i,j) G M, MUL 
defines a Hamiltonian cycle ofV if and only if {M/e) U (L/e) defines a Hamiltonian 
cycle of V\{i,j}. 

Given an MB graph G(7t^, ..., the coMfracfion of an edge e = (z,j) G {(z,j) : 
z, j G V,i j} is the operation that modihes G(7t^ , . . . , tt'^) as follows. Edge (z, j) is 
removed along with nodes i and j. For k = 1 , ... ,q, k replaced by M{tt^) /e, 

and the base matching H is replaced by H/e. Figure 0 illustrates the contraction of 
edge (z, j). 

Note that the graph obtained after an edge contraction is not necessarily an MB 
graph, in particular there may be some k G Q such that M {ir^) C H defines more than 
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one cycle after contraction. This is apparently a drawback, since our methods deal with 
instances obtained by contracting edges. To overcome this, our algorithms are suited for 
the following generalization of CMP. Consider a graph G on node set V, with \ V\ = 2h 
for some integer h > 1, along with a perfect matching H, called base matching, and 
let E ■= {(i, j) : i,j € V,i ^ j} \ H. Given q perfect matchings Mi, . . . , Mg C E 
(which do not necessarily define Hamiltonian cycles with iJ), find a perfect matching 
T C E such that TUH is a Hamiltonian cycle and minimized. 

In the rest of the paper, we will use the term CMP to denote this more general version. 

3 A Combinatorial Branch-and-Bound Algorithm 

We next describe a simple branch-and-bound algorithm for CMP (and RMP) based on 
a straightforward lower bound (as opposed to the sophisticate LP relaxation of the next 
section) that can be computed in time linear in n. The key issue of our approach is that 
branching corresponds to edge contraction, and yields another problem with the same 
structure for which the above lower bound can be computed as well. We describe the 
method for CMP and then mention the slight modification required to solve RMP. 

We start by illustrating the combinatorial lower bound, which is only based on the 
property that the cycle distance satisfies the triangle inequality and is well known for 
other median problems in metric spaces (see e.g. [HIl)- As in Section El let qn — j* 
denote the optimal solution value of CMP. 

Lemma 2. Given a CMP instance associated with matchings Mi, ... , Mg, 



The lower bound on the optimal CMP (and RMP) value given by qn minus the left- 
hand side of dZJ, called Ibc, can be computed in 0{nq^) time since computation of 
c{Mk, Ml) for all k, I = 1, ... ,q,k ^ I, takes 0{n) time. In the next section we give 
an LP interpretation of this bound. 

Recall the definition of edge contraction in Section El The branching scheme of our 
algorithm is inspired by the following 

Lemma 3. Given a CMP instance and an edge e € E, the best CMP solution contain- 
ing e is given by TU {e}, where T is the optimal solution of the CMP instance obtained 
by contracting edge e. 

Proof. Let T T U {e}. First of all, note that T is a feasible CMP solution by Lemma 
n Now suppose there is another solution T* with e G T* and 




9-1 9 



( 2 ) 




(3) 



feGQ 



fcGQ 



For all k such that e € M^, a cycle of length 2 formed by two copies of e is defined by 
both T* U Mk and T U M^. Furthermore, contraction of edge e ensures a one-to-one 
correspondence between cycles defined by T U Mfc (resp. T* U Mk) which are not two 
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copies of e and cycles in fuMk (resp. T*\{e}UMj.) after the contraction of e. Hence, 
0 would contradict the optimality of T. □ 

According to Lemma |31 if we fix edge e in the CMP solution, an upper bound on the 
number of cycles is given by |{fc : Mk 3 e}| (i.e., the number of cycles of length 2 
defined by two copies of e) plus the upper bound (H computed after the contraction 
of e. This allows us to design a branch-and-bound algorithm where, starting from node 
0 G y, we enumerate all permutation matchings by hxing, in turn, either edge (0, 1), 
or (0, 2), . . . , or (0, 2n) in the solution. Recursively, if the last edge fixed in the current 
partial solution is (z, j) and the edge in H incident to j is (j, k), we proceed with the 
enumeration by hxing in the solution, in turn, edge (k,l), for all I with no incident 
edge hxed so far. We proceed in a depth hrst way. With this scheme we can perform 
the lower bound test after each hxing in 0{nq^) time (in fact, the recomputation of 
the lower bound after each hxing is done parametrically and, in practice, turns out to 
be faster than from scratch). In order to have a fast processing of the subproblems, 
the only operations performed are edge contraction and the bound computation, and 
the incumbent solution is updated only when the current partial solution is a complete 
solution. 

The main drawback with the above scheme is that good solutions are found after 
considerable computing time. To overcome this, we start from the lower bound lb c 
computed for the original problem and call the branch-and-bound hrst with a target 
value t := Ibc, searching for a CMP solution of value t and backtracking as soon as 
the lower bound for the current partial solution is > f. If a solution of value i is found, 
it is optimal and we stop, otherwise no solution of value better than Ibc + ^ exists. 
Accordingly, we call the branch-and-bound with target value t := Ibc + and so on, 
stopping as soon as we hnd a solution with the target value. Even if this has the effect 
of reconsidering some subproblem more than once, every call of branch-and-bound 
with a new value of i takes typically much longer than the previous one, therefore 
the increase in running time due to many calls is negligible. On the other hand, the 
scheme allows for the fast derivation of good lower bounds, noting that f is a lower 
bound on the optimal CMP value and is increased by 1 after each call, with the minimal 
core memory requirements of a depth-first branch-and-bound (we often examine several 
million subproblems so the explicit storage of the subproblems would be impossible). 

The above algorithm can easily be modified to find optimal RMP solutions. In par- 
ticular, each time the current partial solution is a complete solution, of value (say) qn — 7 
(w.r.t. the CMP objective function), the above algorithm tests if gn = 7 = t. If this is 
the case, the algorithm stops as the current solution is optimal. In the modified ver- 
sion, we compute the value 6 of the current solution w.r.t. the RMP objective function. 
If (5 = f, again we stop. Otherwise, we possibly update the incumbent RMP solution 
value (initially set to 00 ). In any case, the algorithm stops when the target value i 
(increased at each iteration) satisfies S* < t. 

The branch-and-bound algorithm is effective for many instances, providing a prov- 
ably optimal solution within short time, especially in the relevant special case of g = 3, 
for which the lower bound is reasonably close to the optimal value. In particular, the 
processing of each subproblem within the branch-and-bound enumeration is very fast 
(for our instances, a few hundred thousand subproblems per second on a PC). For the 
remaining instances, the method is good at finding lower bounds on the optimal CMP 
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(and RMP) value but does not provide good heuristic solutions (even if various heuris- 
tics are applied within the enumeration scheme). The following section describes a 
heuristic based on a natural LP relaxation that performs well in practice. 



4 An LP-Based Heuristic 

In we proposed a natural ILP formulation of CMP with one binary variable x e for 
each edge e G E, equal to 1 if e is in the CMP solution and 0 otherwise, and one binary 
variable yc for each cycle that the CMP solution may define with M^, k G Q. For 
details, we refer to m or to the full paper. 

The ILP formulation contains an exponential (in n) number of variables and con- 
straints. Nevertheless, due to a fundamental result of Grotschel, Lovasz and Schrijver 
m, the associated LP relaxation can be solved in polynomial time provided the sepa- 
ration of the constraints and the generation of the y variables can be done efficiently. 
It is shown in (5j| that this is indeed the case. However, in practice, this LP relaxation 
turns out to be very difficult to solve with the present state-of-the-art LP solvers. In the 
full paper, we provide experimental results showing that, even for q = 3, the largest 
LPs solvable in a few hours correspond to instances with n — 30, that can be solved 
within seconds by the combinatorial branch-and-bound algorithm of the previous sec- 
tion, whereas LPs for n > 40 may take days or even weeks to be solved. Hence, an 
exact algorithm based on the ILP formulation seems (at present) useless in practice. In 
this section, we show how to use the LP relaxation within an effective heuristic. 

The heuristic starts from the (in most cases, fractional) vectors x* produced at each 
iteration within the iterative solution of the LP relaxation by column generation and 
separation. For a given x*, we apply a nearest neighbor algorithm, which starts from 
some node ig G V, selects the edge (io,j) G E such that x*^^ is maximum, con- 
siders the node I such that (j, 1) G E[, selects the edge (l,p) G E such that p ^ io 
and x*i is maximum, considers the node r such that (p, r) G H, and so on, until 
the edges selected form a permutation matching. We then apply to the final solution a 
2-exchange procedure, where we check if the removal of two edges in the current solu- 
tion and their replacement with the two other edges that yield a permutation matching 
yields an improvement in the CMP value. If this is the case, the replacement is per- 
formed and the procedure is iterated. Otherwise, i.e., if no 2-exchange in the current 
solution improves the CMP value, the procedure terminates. Each time a new solution 
is considered, namely the initial solution produced by nearest neighbor and the solution 
after each improvement, we compute the corresponding RMP value to possibly update 
the best RMP solution so far. 

It is easy to see that the complexity of nearest neighbor is O(n^), plus 0{qn) for the 
evaluation of the value of the CMP (and RMP) solution found. Also easy is to verify 
that checking if a 2-exchange yields an improvement in the CMP value can be done 
in 0{q) time if one has a data structure that, for each k G Q and i G V, tells which 
is the cycle in T U Mk that visits node i, where T is the current solution (cycles are 
simply identified by numbers). This data structure can be reconstructed in 0{qn) time 
every time the current solution is improved. Hence, each iteration of the 2-exchange 
procedure, that either yields an improvement or terminates the procedure, takes 0{qn 
time since O(n^) 2-exchanges are considered. 
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For every vector cc*, we try each node i & V starting node in nearest neigh- 
bor. This is because starting from different nodes may yield solutions of considerably 
different quality. Now the point is how to produce each x* within reasonable time, 
as the solution of each LP within the iterative procedure is quite time consuming, as 
mentioned above. A natural choice is to work with a “sparsified” graph, where only a 
subset of the edges E' C E is considered in the LP (the variables associated with the 
other edges other being fixed to 0). To this aim, the algorithm is organized in rounds. 
In the first round, we let E' := UfeeQ other rounds, we consider the other 

edges by increasing value of lower bound Ibc- In particular, for each edge e G E, we 
compute lbc{e), the value of lower bound Ibc if edge e is fixed in the solution. In the 
second round we let E' := UfeeQ U {e G E : lbc{e) < Ibc}, in the third round 
E' := UfcgQ U {e G i? : lbc{e) < Ibc + 1}> nnd so on. We stress that in the 
nearest neighbor heuristic and in the 2-exchange procedure the whole set E of edges is 
considered (the definition of E' is only meant to speed-up the solution of the LPs). 

To further drive the heuristic with the LP solutions, every time in a round the value 
of the current LP 7 is such that [(/n — 7 ] < 5*, where 5* is the best RMP solution 
value so far, wefixio 1 the [n/10] x variables whose value is highest, imposing the 
associated edges in the solution. We make sure that the partial solution is contained 
into a CMP solution, and perform the fixing only if at least 3 iterations (of column 
generation or separation followed by an LP) were performed since the last fixing. The 
round terminates when the optimal value of the LP (containing only the edges in E ' and 
with some edges fixed in the solution), say 7 , satisfies [gfi — 7 ] > ^*. 

The overall heuristic terminates either after a time limit or when the round corre- 
sponding to E' = E terminates. 



5 Experimental Results 

Our algorithms were implemented in C and ran on a Digital Ultimate Workstation 
533MHz. As already mentioned, the LP solver used was CPLEX 6.5. Moreover, we 
computed the reversal distance between permutations using the implementation of the 
linear time algorithm of m, which is available at 
http : / /WWW . cs . unm . edu/ ~moret /GRAPPA (in particular, we used the func- 
tion invdist Moncircular jiomem( )). 

The main test instance in the early development of our code is associated with three 
real-world genomes each with n = 33 genes and reported in an. This instance already 
gives some clear indication about the solution methods proposed in this paper, as our 
branch-and-bound algorithm can solve it within less than 0.2 seconds (the optimal value 
being 41), the LP heuristic finds an optimal solution in less than 0.05 seconds, whereas 
the associated (complete) LP relaxation was not solved after 500, 000 seconds! 

In our experiments, we mainly solved instances with q G {3,4,5}. The main reason 
is that the RMP instances solved within the reconstruction of evolutionary trees are 
associated with small values of q (typically, q = 3). Moreover, as shown by the tables, 
the effectiveness of the lower bounds proposed in this paper quickly decreases as q 
increases; therefore not much can be said about the optimality or near-optimality of 
the solutions provided for instances with higher values of q (with some exceptions, see 
below). 
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All tables report the values of q and n, the time limit imposed on the branch-and- 
bound algorithm and on the LP heuristic for all instances (within the caption), and the 
following information: 

# inst: the number of instances considered for each value of q and n (average and 
maximum values refer to this number of instances); 

# opt. the number of instances solved to optimality by the branch-and-bound algo- 
rithm within the time limit; 

5*\ the average value of the best RMP solution found — the optimal one if the 
branch-and-bound algorithm terminates before the time limit, otherwise the best 
solution produced by the LP heuristic; 

Ibc- the average value of lower bound Ibc', 

IbsB'- the average value of the lower bound produced by the branch-and-bound 
algorithm (equal to the optimum if the algorithm terminates before the time limit); 
B&B subpr. : the average number of subproblems considered within the branch-and- 
bound algorithm; 

B&B time: the average (maximum) time required by the branch-and-bound algo- 
rithm (possibly equal to the time limit); 

5^: the average value of the best RMP solution produced by the LP heuristic; 

H time: the average (maximum) time required to find the best solution by the LP 
heuristic; 

gap: the average (maximum) difference between the best RMP solution found and 
the lower bound produced by the branch-and-bound algorithm. 



Table 1. Results on Uniformly Random Instances, Time Limit of 10 Minutes. 



Q 


n 


# inst. 


# opt. 


5* 


Ibc 


IbBB 


B&B subpr. 


B&B time 




H time 


gap 


3 


10 


10 


10 


14.2 


13.5 


14.2 


328.6 


0.0 ( 0.0) 


14.2 


0.0 ( 0.2) 


0.0 (0) 


3 


20 


10 


10 


29.2 


27.5 


29.2 


43312.0 


0.1 (0.3) 


29.2 


0.2 (0.8) 


0.0 (0) 


3 


30 


10 


10 


45.0 


42.8 


45.0 


6504469.6 


10.2(45.6) 


45.3 


34.0 ( 232.4) 


0.0 (0) 


3 


40 


10 


2 


61.7 


57.2 


60.3 


317477043.2 


524.9 ( 600.0) 


61.8 


58.0(427.7) 


1.4(4) 


3 


50 


10 


0 


79.2 


72.3 


75.1 


335299968.0 


600.0 ( 600.0) 


79.2 


184.7 (584.9) 


4.1 (5) 


4 


10 


10 


10 


21.1 


17.9 


21.1 


6026.2 


0.0 ( 0.0) 


21.1 


0.1 (0.4) 


0.0 (0) 


4 


20 


10 


10 


44.1 


37.2 


44.1 


70430739.2 


171.2( 536.8) 


44.1 


2.8 ( 8.6) 


0.0 (0) 


4 


30 


10 


0 


69.2 


57.0 


63.4 


214200012.8 


600.0 ( 600.0) 


69.2 


29.8 ( 183.1) 


5.8 (7) 


4 


40 


10 


0 


93.7 


76.3 


82.2 


192300006.4 


600.0 ( 600.0) 


93.7 


80.2 ( 254.2) 


11.5(13) 


4 


50 


10 


0 


121.2 


96.6 


101.9 


173600000.0 


600.0 ( 600.0) 


121.2 


96.9 ( 450.2) 


19.3 (20) 


5 


10 


10 


10 


28.7 


22.4 


28.7 


69344.3 


0.2 (0.3) 


28.7 


0.1 (0.2) 


0.0 (0) 


5 


20 


10 


0 


59.3 


46.3 


56.6 


175400012.8 


600.0 ( 600.0) 


59.3 


63.4(588.1) 


2.7 (3) 


5 


30 


10 


0 


91.8 


71.2 


79.7 


148699993.6 


600.0 ( 600.0) 


91.8 


121.7 (423.8) 


12.1 (14) 


5 


40 


10 


0 


125.7 


95.3 


103.3 


129500006.4 


600.0 ( 600.0) 


125.7 


141.8 ( 359.5) 


22.4 (25) 


5 


50 


10 


0 


161.5 


120.6 


127.7 


113400000.0 


600.0 ( 600.0) 


161.5 


275.2 ( 507.2) 


33.8 (35) 



In Table Q] we present results on instances associated with uniformly random permuta- 
tions. The table shows that, for q — 3, the maximum size of such instances solvable to 
proven optimality within reasonable time is between 30 and 40 (with a time limit of 1 
hour instead of 10 minutes, 6 instances out of 10 can be solved for n = 40). For q = 4 
the threshold is just above 20, whereas it is below 20 for q = 5. This reflects the bad 
quality of the combinatorial lower bound for q > 3. The LP heuristic in some cases re- 
quires some time to find the best solution, even if solutions whose value is 1 or at most 
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2 units larger are typically found in fractions of a second. Our feeling is that the solu- 
tions found by the LP heuristic are near optimal even for the cases in which the lower 
bound cannot certify this. Table prefers to the randomly generated instances mentioned 



Table 2. Results on Random Instances from setl in [El , Time Limit of 10 Minutes. 



r 


q 


n 


# inst. 


# opt. 


<5* 


Ibc 


IbsB 


B&B subpr. 


B&B time 




H time 


gap 


2 


3 


20 


10 


10 


14.0 


13.5 


14.0 


364.7 


0.0 ( 0.0) 


14.2 


0.0 ( 0.0) 


0.0 (0) 


2 


3 


40 


10 


10 


14.7 


14.0 


14.7 


873.4 


0.0 ( 0.0) 


14.9 


0.0 ( 0.0) 


0.0 (0) 


2 


3 


so 


10 


10 


15.1 


14.2 


15.1 


5341.7 


0.0 ( 0.0) 


15.1 


0.1 (0.2) 


0.0 (0) 


2 


3 


160 


10 


10 


15.1 


14.2 


15.1 


21075.8 


0.0 (0.1) 


15.1 


0.7 ( 0.7) 


0.0 (0) 


2 


3 


320 


10 


10 


15.0 


14.2 


15.0 


56076.9 


0.1 (0.2) 


15.0 


5.2 ( 5.2) 


0.0 (0) 


4 


3 


10 


10 


10 


14.5 


13.6 


14.5 


492.9 


0.0 ( 0.0) 


14.8 


0.0 ( 0.0) 


0.0 (0) 


4 


3 


20 


10 


10 


27.5 


26.5 


27.5 


31017.6 


0.0 ( 0.3) 


27.9 


0.0 (0.1) 


0.0 (0) 


4 


3 


40 


10 


9 


47.1 


45.0 


46.9 


39167574.4 


63.0 ( 600.0) 


47.4 


0.3 ( 2.7) 


0.2 (2) 


4 


3 


80 


10 


10 


56.5 


55.1 


56.5 


4342064.8 


7.0(70.2) 


56.8 


0.4 ( 1.1) 


0.0 (0) 


4 


3 


160 


10 


10 


57.5 


56.5 


57.5 


16382.0 


0.0 (0.1) 


57.6 


1.0 ( 1.6) 


0.0 (0) 


4 


3 


320 


10 


10 


57.6 


56.6 


57.6 


45550.9 


0.1 (0.1) 


57.6 


6.0 ( 10.2) 


0.0 (0) 


8 


3 


10 


10 


10 


14.3 


13.8 


14.3 


267.7 


0.0 ( 0.0) 


14.6 


0.0 ( 0.0) 


0.0 (0) 


8 


3 


20 


10 


10 


29.5 


27.9 


29.5 


42045.8 


0.1 (0.2) 


30.1 


0.1 (0.2) 


0.0 (0) 


8 


3 


40 


10 


4 


61.6 


57.3 


60.3 


258996275.2 


433.1 (600.0) 


62.1 


0.9 (6.1) 


1.3(3) 


8 


3 


so 


10 


0 


125.5 


112.7 


115.0 


292400000.0 


600.0 ( 600.0) 


125.5 


9.8 ( 18.3) 


10.5(13) 


8 


3 


160 


10 


0 


190.6 


177.5 


179.7 


272000025.6 


600.0 ( 600.0) 


190.6 


46.2 ( 276.5) 


10.9 (24) 


8 


3 


320 


10 


9 


205.0 


203.4 


204.8 


37536243.2 


66.7 ( 600.0) 


205.0 


29.2(73.1) 


0.2 (2) 


16 


3 


10 


10 


10 


14.2 


13.4 


14.2 


247.3 


0.0 ( 0.0) 


14.4 


0.0 (0.0) 


0.0 (0) 


16 


3 


20 


10 


10 


29.5 


28.0 


29.5 


96419.2 


0.1 (0.7) 


29.9 


0.0 (0.1) 


0.0 (0) 


16 


3 


40 


10 


3 


62.0 


57.8 


60.6 


299220787.2 


501.5 (600.0) 


62.5 


0.4 ( 0.6) 


1.4(3) 


16 


3 


so 


10 


0 


130.8 


116.0 


118.3 


288000000.0 


600.0 ( 600.0) 


130.8 


10.8 ( 23.8) 


12.5 (14) 


16 


3 


160 


10 


0 


276.6 


236.2 


237.3 


204600000.0 


600.0 ( 600.0) 


276.6 


248.8 ( 508.1) 


39.3 (42) 


16 


3 


320 


10 


0 


537.1 


451.4 


451.8 


146000000.0 


600.0 ( 600.0) 


537.1 


244.5 ( 600.0) 


85.3 (98) 



in IXZ] and publicly available at http : / / www . cs . unm . edu/ moret /GRAPPA 
We report only results for the instances setl — the results for those in set2 are analogous 
and can be found in the full paper. All these instances have q = 3 permutations, each 
generated starting from the identity permutation and simulating a number of evolution- 
ary changes. Parameter r increases with the number of changes simulated, therefore for 
larger values of r the RMP solution value is larger. The two tables show that, at least 
for g = 3, RMP is typically easy to solve if the solution value (i.e., the overall reversal 
distance) is considerably smaller w.r.t. the case of uniformly random permutations, for 
which the average solution value seems to be roughly 1.5n (see Table Q]). Specifically, 
instances with S* < n can be solved quickly, even for n = 320. We stress that these are 
the instances for which the reversal distance gives more reliable indication about the 
actual evolutionary distance, see also im. Note that in both tables, for r = 8, instances 
with n = 320 (for which i5* < n) are easier than instances with n = 80 and n = 160 
(for which S* > n). 

We conclude presenting results for instances associated with real-world genomes. 
We considered two sets of genomes, one corresponding to chloroplast genomes men- 
tioned in O and available at http://www.cs.unm.edu/~moret/GRAPPA, 
each with n = 105 genes, and the other to mitochondrial genomes that we obtained 
from Mathieu Blanchette [0, each with n = 37 genes. For each genome set and 
g = 3, 4, 5, we considered, respectively, the instances associated with all triples, four- 
tuples and fivetuples in the set. The results are reported in Table Eland El As the number 
of instances is quite large, we imposed one minute time limit both on the branch-and- 



On the Practical Solution of the Reversal Median Problem 



249 



Table 3. Results on Instances from 13 Chloroplast Genomes, Time Limit of 1 Minute. 



7 


n 


# inst. 


# opt. 


(5* 


Ibc 


IhBB 


B&B subpr. 


B&B time 




H time 


gap 


3 


105 


286 


286 


19.4 


18.7 


19.4 


48697.0 


0.1 (1.1) 


19.4 


-(-) 


0.0 (0) 


4 


105 


715 


651 


28.5 


25.0 


28.3 


6257725.6 


12.1 (60.0) 


28.5 


0.4 (5.1) 


0.1 (4) 


5 


105 


1287 


1073 


36.5 


31.2 


36.3 


9265956.4 


24.1 (60.0) 


36.5 


0.8 (7.3) 


0.2 (4) 



1 Trachelium 2 Campanula 3 Adenophora 4 Symphyandra 5 Legousia 

6 Asyneuma 7 Triodanus 8 Wahlenbergia 9 Merciera 10 Codonopsis 

11 Cyananthus 12 Platy codon 13 Tobacco 

Table 4. Results on Instances from 15 Mitochondrial Genomes, Time Limit of 1 
Minute. 



7 


n 


# inst. 


# opt. 


5- 


Ihc 


IbsB 


B&B subpr. 


B&B time 




H time 


gap 


3 


37 


455 


394 


45.0 


42.8 


44.7 


8338102.3 


13.3 ( 60.0) 


45.C 


0.4 ( 1.7) 


0.2 (5) 


4 


37 


1365 


59 


69.2 


57.1 


62.9 


21866065.4 


58.0 ( 60.0) 


69.2 


0.8 ( 10.8; 


6.3 (16) 


5 


37 


3003 


19 


90.8 


71.3 


79.1 


15178139.2 


59.8 ( 60.0) 


90.8 


2.6(43.8) 


11.6 (23) 



1 Homo Sapiens 2 Albinaria Coerulea 3 Arbacia Lixula 

4 Artemia Franciscana 5 Ascaris Suum 6 Asterina Pectinifera 

7 Balanoglossus Camosus 8 Cepaea Nemoralis 9 Cyprinus Carpio 

10 Drosophila Yakuba 1 1 Florometra Serratissima 12 Katharina Tunicata 

13 Lumbricus Terrestris 14 Onchocerca Volvulus 15 Struthio Camelus 



Table 5. Results on Instances from 13 Chloroplast Genomes for Large Values of q. 
Time Limit of 1 Hour. 



Q 


n 


# inst. 


# opt. 


5* 


Ibc 


IbBB 


B&B subpr. 


B&B time 




H time 


gap 


6 


105 


13 


13 


42.1 


34.8 


42.1 


135100475.1 


465.7 (3149.2) 


42.1 


-(-) 


0.0 (0) 


7 


105 


13 


12 


51.0 


41.5 


50.9 


150696379.1 


673.4 (3600.0) 


51.0 


4.2 ( 4.2) 


0.1 (1) 


8 


105 


13 


8 


60.6 


47.9 


59.8 


272373385.8 


1548.1 (3600.0) 


60.6 


0.8 ( 0.9) 


0.8 (4) 


9 


105 


13 


10 


69.0 


54.6 


68.5 


263701838.8 


1828.0 (3600.0) 


69.0 


1.3 ( 1.9) 


0.5 (4) 


10 


105 


13 


7 


78.3 


61.2 


76.9 


234546156.3 


1943.5 (3600.0) 


78.3 


2.1 (6.2) 


1.4 (5) 


11 


105 


13 


8 


86.7 


68.2 


85.6 


204401368.6 


1989.2 (3600.0) 


86.7 


1.6 (2.7) 


1.1 (4) 


12 


105 


13 


8 


95.1 


74.5 


94.2 


185584896.0 


2121.0 (3600.0) 


95.1 


1.4 ( 1.5) 


0.8 (3) 


13 


105 


1 


1 


103.0 


81.0 


103.0 


179367376.0 


2361.0 (2361.0) 


103.0 


-(-) 


0.0 (0) 



bound and the heuristic. Moreover, we applied the LP heuristic only to the instances 
that were not solved to optimality by the branch-and-bound. 

The behavior for the mitochondrial instances is analogous to the case of random 
permutations, namely almost all instances can be solved to proven optimality for g = 3 
whereas for g = 4 and 5 the gap is 0 only for very few instances, being (on average) 
equal to 9% and 13%, respectively. 

On the other hand, chloroplast instances turn out to be relatively easy to solve, the 
solution value being much smaller than for random permutations. With very few excep- 
tions, all instances are solved within one unit of the optimum in very short time. For this 
reason, we tried also to solve instances for higher values of q. For g G {6, . . . , 12} we 
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solved the 13 instances obtained by taking q consecutive (in a circular sense) genomes 
starting from the first, the second, . . . , the thirteenth. For g = 13 we solved the instance 
with the whole genome set. Table El gives the corresponding results, with a time limit 
of one hour, motivated by the fact that the number of instances considered is small and 
that, for n = 105, 1 minute or one hour often makes a big difference, which is not the 
case for “small” n (say, n < 50), probably because of the exponential nature of the al- 
gorithm. The table shows that, for these genomes, even an instance with g = 13 can be 
solved optimally within reasonable time, and that the average gap over all the instances 
is about 1 unit. 
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Abstract. Comparing gene orders in completely sequenced genomes is a stan- 
dard approach to locate clusters of functionally associated genes. Often, gene or- 
ders are modeled as permutations. Given k permutations of n elements, a fc-tuple 
of intervals of these permutations consisting of the same set of elements is called 
a common interval. We consider several problems related to common intervals 
in multiple genomes. We present an algorithm that hnds all common intervals in 
a family of genomes, each of which might consist of several chromosomes. We 
present another algorithm that hnds all common intervals in a family of circular 
permutations. A third algorithm hnds all common intervals in signed permuta- 
tions. We also investigate how to combine these approaches. All algorithms have 
optimal worst-case time complexity and use linear space. 



1 Introduction 

The conservation of gene order has been extensively studied so far 112.511 911 611 2T . There 
is strong evidence that genes clustering together in phylogenetically distant 
genomes frequently encode functionally associated proteins FI23I4I24I or indicate recent 
horizontal gene transfer ifllFl . Due to the increasing amount of completely sequenced 
genomes, the comparison of gene orders to find conserved gene clusters is becoming a 
standard approach for protein function prediction 1 1201171221611 . 

In this paper we describe efficient algorithms for finding gene clusters for various 
types of genomic data. We represent gene orders by permutations (re-orderings) of inte- 
gers. Hence gene clusters correspond to intervals (contiguous subsets) in permutations, 
and the problem of finding conserved gene clusters in different genomes translates to 
the problem of finding common intervals in multiple permutations. 

In addition to this bioinformatic application, common intervals also relate to the 
consecutive arrangement problem 1217181 and to cross-over operators for genetic al- 
gorithms solving sequencing problems such as the traveling salesman problem or the 
single machine scheduling problem MMM. 

Recently, Uno and Yagiura dzEl presented an optimal 0(n + K) time and 0(n) 
space algorithm for finding all K < ( 2 ) common intervals of two permutations tti 
and 7T2 of n elements. We generalized this algorithm to a family II = (tti, . . . , tt^) of 
k >2 permutations in optimal 0{kn + K) time and 0{n) space [HID by restricting the 
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set of common intervals to a smaller, generating subset. To apply common intervals to 
the bioinformatic problem of finding conserved clusters of genes in data derived from 
completely sequenced genomes we further extended the above algorithm to additional 
types of permutations. 

Genomes of higher organisms generally consist of several linear chromosomes while 
hacterial, archaeal, and mitochondrial DNA is organized in one to several circular 
pieces. While in the first case the algorithm from [O might report too many gene 
clusters if the multiple chromosomes are simply concatenated, in the latter case gene 
clusters might he missed if the circular pieces are cut at some arbitrary point. We handle 
this problem by adapting the original algorithm to multichromosomal permutations as 
well as circular permutations. 

For prokaryotes, it is also known that, in the vast majority of cases, functionally 
associated genes of a gene cluster lie on the same DNA strand iraiTTI . We take this 
into account by constructing signed permutations where the sign of a gene indicates the 
strand it lies on. We then determine all common intervals with the additional restriction 
that within each permutation, the elements of a common interval must have the same 
sign, while between permutations the sign might vary. This allows us to restrict the set 
of common intervals to biologically meaningful candidates. 

The paper is organized as follows. In Section|2lwe formally define common intervals 
and related terminology. We briefly describe the algorithms of Uno and Yagiura [l^ 
and of Heber and Stoye HH to find all common intervals of 2 (respectively fc > 2) 
permutations. Then we present time- and space-optimal algorithms for the problem of 
finding all common intervals in multichromosomal permutations (Section 0, in signed 
permutations (Section EJ, and in circular permutations (Section 0. In Section 0 we 
show how the various approaches can be combined without sacrificing the optimal time 
complexity. SectionElconcludes with few final remarks. 

2 Common and Irreducible Intervals 

2.1 Basic Definitions 

A permutation ir of (the elements of) the set iV := {1, 2, . . . , n} is a re-ordering of the 
elements of N. We denote by 7r(i) = j that the ith element in this re-ordering is j. For 
1 < X < y < n, we set [x, y] := {x, x -F 1, . . . , y} and call 7r([x, y]) := {7r(i) | i G 
[x, y\\ an interval of tt. 

Let n = (tti, . . . , TTfe) be a family of k permutations of N. Without loss of gener- 
ality we assume in this section that tti = idn ■= (1, . . . , n). A subset c C N is called 
a common interval of U if and only if there exist 1 < I j < Uj < n for all 1 < J < fc 
such that 

C = 7ri([ii,Ui]) = 'K2{[h,U2\) = ... = TTk{[lk,Uk\). 

Note that this definition excludes common intervals of size one. 

In the following we represent a common interval c either by specifying its elements 
or by the shorter notation Uj]) for a j G {1, . . . , n}. (For tTj = this notation 

further simplifies to [Ij^Uj].) The set of all common intervals of 77 = (tti, . . . , tt^) is 
denoted Cn ■ 
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Example 1. Let N = {1, . . . , 9} and II = (tti, 7T2, tts) with tti = idg, tt 2 = (3, 2, 1, 9, 
7, 8, 6, 5, 4), and tts = (4, 5, 6, 8, 7, 1, 2, 3, 9). With respect to tti we have 

Cff = { [1, 2] , [1, 3] , [1, 9] , [2, 3] , [4, 5] , [4, 6] , [4, 8] , [5, 6] , [5, 8] , [6, 8] , [7, 8] }. 

□ 

In order to keep this paper self-contained, in the remainder of this section we recall the 
algorithms of Uno and Yagiura | l2ht and of Heber and Stoye IK j| that find all common 
intervals of 2 (respectively k > 2) permutations. We will restrict our description to 
basic ideas and only give details where they are necessary for an understanding of the 
new algorithms described in Sections EHSlof this paper. 

2.2 Finding All Common Intervals of Two Permutations 

Here we consider the problem of finding all common intervals of k = 2 permutations 
7Ti = idn and irg of N. 

An easy test if an interval 7T2([a;, ?/]), 1 < x < y < n, is a common interval of 
n = (tti, 7T2) is based on the following functions: 

l{x,y) := min7T2([a;,y]) 
u{x,y) := max7T2([a;,y]) 
f{x,y) := u{x,y) - l{x,y) - {y - x). 

Since /(x, y) counts the number of elements in [l{x, y), u{x, y)] \7T2([x, y]), an interval 
7T2 ([x, y]) is a common interval of 77 if and only if /(x, y) = 0. A simple algorithm to 
find Cn is to test for each pair of indices (x, y) with l<x<y<nif /(x, y) = 0, 
yielding a naive O(n^) time or, using running minima and maxima, a slightly more 
involved 0{n^) time algorithm. 

In order to save the time to test /(x, y) = 0 for some pairs (x, y), Uno and Yagiura 
|i22j| introduce the notion of wasteful candidates for y. 

Definition 1. For a fixed x, a right interval end y > x is called wasteful if it satisfies 
/(x'j y) > 0 for all x' < x. 

Based on this notion, Uno and Yagiura give an algorithm called RC (short for Reduce 
Candidate) that has as its essential part a data structure Y consisting of a doubly-linked 
list ylist for the indices of non-wasteful right interval end candidates and, storing inter- 
vals of ylist, two further doubly-linked lists Hist and ulist that implement the functions 
I and u in order to compute / efficiently. An outline of Algorithm RC is shown in Algo- 
rithmO] where L.succ{e) denotes the successor of element e in a doubly linked list L. 
After initializing the lists of Y , a counter x (corresponding to the currently investigated 
left interval end) runs from n — 1 down to 1. In each iteration step, during the update of 
Y, ylist is trimmed such that afterwards the function /(x, y) is monotonically increas- 
ing for the elements y remaining in ylist. In lines 5-7, this allows us to efficiently find 
all common intervals with left end x by evaluating /(x, y) running left-to-right through 
the elements y > x of ylist until an index y is encountered with /(x, y) > 0 when the 
reporting procedure stops. 
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Algorithm 1 (Reduce Candidate, RC) 

Input: A family 77 = (tti = idn, ti 2 ) of two permutations of A = n}. 

Output: The set of all common intervals Cn- 
1 : initialize Y 
2: for a; = n — 1, . . . , 1 do 

3: update y // trim j/Zist, update ZZist and uiist 

4: y <— a; 

5: while {y ^ ylist.succ{y)) defined and f{x, y) = 0 do 

6: output [Z (a:, j/),u(a;,t/)] 

7: end while 

8: end for 



For details of the data structure Y and the update procedure in line 3, see H26iiilDI . 
The analysis shows that the update of data structure Y in line 3 can he performed in 
amortized 0(1) time, such that the complete algorithm takes 0{n + K) time to find the 
K common intervals of tti and 7T2. 

2.3 Irreducible Intervals 

Before we show how to generalize Algorithm RC to find all common intervals of A: > 2 
permutations, we first present a useful generating subset of the set of common intervals, 
the set of irreducible intervals uni and report a few of their properties. 

We say that two common intervals ci, C 2 G Cn have a non-trivial overlap if ci C 
C 2 0 and neither includes the other. A list p = (ci, . . . , c^(p)) of common intervals 
Cl, ... , c^(p) C Cn is a chain (of length t{p)) if every two successive intervals in p 
have a non-trivial overlap. A chain of length one is called a trivial chain, all other 
chains are called non-trivial chains. A chain that can not be extended to its left or right 
is a maximal chain. It is easy to see that for a chain p of common intervals, the interval 
t{p) '■= Uc'Gp ^ common interval as well. We say that p generates t{p) . 

Definition 2. A common interval c is called reducible if there is a non-trivial chain that 
generates c, otherwise it is called irreducible. 

This definition partitions the set of common intervals C n into the set of reducible in- 
tervals and the set of irreducible intervals, denoted In - Obviously, 1 < |// 7 | < \Cn\ < 

(;)■ 

Example 1 (cont’d). For 77 = (tti, 7T2, tts) as above, the irreducible intervals (with 
respect to tti = idg) are 

In = {[1,2], [1,9], [2, 3], [4, 5], [5, 6], [6, 8], [7, 8]}. 

The reducible intervals are generated as follows: 

[1,3] = [1,2] U [2,3], 

[4,6] = [4,5] U [5,6], 

[4.8] = [4,5] U [5,6] U [6,8], 

[5.8] = [5,6] U [6,8]. 
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We cite the following two results from IlLl (without proofs) which indicate the grefl 
value of the concept of irreducible intervals. 

Lemma 1. Given a family II = (tti, . . . , TTfc) of permutations of N = {1, 2, . . . , n}, 
the set of irreducible intervals In allows us to reconstruct the set of all common inter- 
vals Cn in optimal 0{\Cn\) time. □ 



Lemma 2. Given a family II = (tti, . . . , TTfc) of permutations of N = {1, 2, . . . , n}, 
we have 1 < \In\ < n — 1. □ 

2.4 Finding All Common Intervals of k Permutations 

Now we can describe the algorithm from [01 that finds all K common intervals of a 
family of fc > 2 permutations of N in Ofkn + K) time. 

For 1 < i < fc, set Ili := (tti, . . . , tt^). Starting with In^ = {[j,j + 1] | 1 < 
j < n — 1}, the algorithm successively computes Int from lut-i for i = 2, . . . ,k (see 
Algorithm 0. The algorithm employs a mapping 

‘Pi '■ ^ In, 

that maps each element c G Ini-i to the smallest common interval d G Cni that con- 
tains c. It is shown in Gni that this mapping exists and is surjective, i.e., (fi{Ini-i) '■= 
{<Pi{c) I c G / 77 j_i} = Ini- Furthermore, it is shown that <pi{Ini-i) can be effi- 



Algorlthm 2 (Finding All Common Intervals of fc Permutations) 

Input: A family 77 = (tti = 7T2, . . . , nk) of k permutations of A = {1, ... , n}. 

Output: The set of all common intervals Cn- 
1: ^([l,2],[2,3],...,[n-l,n]) 

2: for i = 2, . . . , fc do 

3: 7t7. ^ {pi(c) I c G 7 t7._j} // (see Algorithm 0 

4: end for 

5: generate Cn from In = In^. using LemmaQ 
6: output Cn 



ciently computed by a modified version of Algorithm RC where the data structure Y 
is supplemented by a data structure S that is derived from I Ui-i- S consists of several 
doubly-linked clists containing intervals of ylist, one for each maximal chain of the 
intervals in lui^i - 

Using 7Ti and tt^, as in Algorithm RC, the ylist of Y allows for a given x to access 
all non-wasteful right interval end candidates y of The aim of S is to further 

reduce these candidates to only those indices y for which simultaneously [x, y] G C Ui-i 
(ensuring [x, y] G Cnd and [x, y] contains an interval c G that is not contained 

in any smaller interval from Cnt- Together this ensures that exactly the irreducible 
intervals \x,y] G Pi^Ini-i) are reported. 
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Algorithm 3 (Extended Algorithm RC) 

Input: A family U — (tti = of two permutations of N = {1, . . . , n}; a set of irre- 

ducible intervals Ini_i- 
Output: The set of irreducible intervals lui ■ 

1 : initialize Y and S 
2: for a; = n — 1, . . . , 1 do 

3: update Y and S // trim ylist, update l/ulist; activate elements of the clists 

4: while {\x' , y\ <— S. first ^active -interval{x)) defined and f{x, t/) = 0 do 

5: output [I {x,y),u{x,y)] 

6: deactivate [s', j/] 

7: end while 

8: end for 



An outline of the modified version of Algorithm RC is shown in Algorithm 0 
Essentially, S keeps a list of active intervals, i.e., intervals from for which the 

image under mapping tpi has not yet been determined. In the reporting loop of lines 4- 
7, rather than testing if f{x,y) = 0 running from left to right through all indices 
y > X of ylist, only right ends of active intervals are tested. Therefore, function 
S. first _active-interval{x) returns the active intervals in left-to-right order with respect 
to their right end y. If an active interval \x' ,y] gives rise to a common interval, i.e., 
if f{x, y) — 0, then an element of (fifilnt-i ) is encountered and the active interval is 
deactivated. Similar to Algorithm RC, reporting stops whenever the first active interval 
with right end y is encountered such that f{x, y) > 0. 

Again, for details of the data structure and the update procedure in line 3 we refer 
to the original description [na. There it is also shown that updating the data structure 
S takes amortized 0(1) time. Hence, due to the reduced output size (see Lemma 0, 
the Extended Algorithm RC takes only 0{n) time. Together with Lemma 0this implies 
the overall time complexity 0{kn + K) for Algorithmic The additional space usage is 



3 Common Intervals of Multichromosomal Permutations 

In view of biological reality, in the following we consider variants of the common in- 
tervals problem that have to be addressed when dealing with real genomic data. Our 
first variant that we consider is the scenario where the genome consists of multiple 
chromosomes. 

As above, let := {1, 2, . . . , n} represent a set of n genes. A chromosome cof N 
is defined as a linearly ordered subset of N and will be represented as a linear list. A 
multichromosomal permutation tt of N is defined as a set of chromosomes, containing 
each element of N exactly once, i.e.. 



Given a family II = (tti, . . . , tt^) of k multichromosomal permutations of N, a 
subset s C TV is called a common interval of II if and only if for each multichromo- 



0(n). 
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somal permutation tt^, z = 1, . . . , fc, there exists a chromosome with s as an interval. 



Example 2. Let N = {1, . . . , 6} and II = (tti, 7T2, tts) with tti = {(1, 2, 3), (4, 5, 6)}, 
7T2 = {(1, 5, 6, 4), (3, 2)}, and tts = {(1, 6, 4, 5), (3), (2)}. Here chromosome ends are 
indicated hy parentheses. The only common interval is {4, 5, 6}. □ 

A modification of Algorithm Q can be used for finding all common intervals of 
k multichromosomal permutations. We start by arranging the chromosomes of each 
multichromosomal permutation in arbitrary order. This way we obtain a family 77 ' = 
(ttJ^ , 7T2, . . . , 7t^) of fc (standard) permutations tt', i = 1, . . . ,k. Without loss of general- 
ity we assume that 7r[ = idn- Now, as above, set 77' := (tt^, tt^, . . . , tt'). Then, starting 
with 



7/7' := {[j,j + 1] I j,j + 1 on the same chromosome in tti and 1 < j < n}, 

the algorithm successively computes In' from In' ^ for z = 2, . . . , 7 using a modifica- 
tion of Algorithm El where in the extended Algorithm RC (Algorithm 0) the reporting 
procedure is not only stopped whenever f{x, y) > 0, but also as soon as the genes at 
indices x and y belong to different chromosomes of tt/. 

By the definition of In'^, this algorithm will never place two genes from different 
chromosomes in tt/ together in a common interval. Moreover, by the modification of 
Algorithm El two genes from different chromosomes of the other genomes 7T2, . . . , 
will never be placed together in a common interval. Nevertheless, the location of com- 
mon intervals that lie on the same chromosome in all genomes is not affected by the 
modification of the algorithm. Since the additional test if x and y belong to the same 
chromosome is a constant time operation, and the output can only be smaller than that 
of the original Algorithm 0 the new algorithm also takes 0{n) time to generate In' 
from ^n'._ ^ . The outer loop (Algorithm 13 and the final generation of the common in- 
tervals from the irreducible intervals (Lemma EJ are unchanged, so that we have the 
following 

Theorem 1. Given k multichromosomal permutations of N = {1, . . . , zz}, all K com- 
mon intervals can be found in optimal 0{kn -f K) time using 0(n) additional space. 

□ 



4 Common Intervals of Signed Permutations 

In this section we consider the problem of finding all common intervals in a family of 
signed permutations. It is common practice when considering genome rearrangement 
problems, to denote the direction of a gene in the genome by a plus (-f) or minus (— ) 
sign depending on the DNA strand it is located on dzl]- In the context of sorting signed 
permutations by reversals miQiniui , the sign of a gene tells the direction of the gene in 
the final (sorted) permutation and changes with each reversal. In our context, it has been 
observed that for prokaryotes, functionally coupled genes, e.g. in operons, virtually al- 
ways lie on the same DNA strand EDO- Hence, when given signed permutations, we 
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require that the sign does not change within an interval. Between the different permuta- 
tions, the sign of the intervals might vary, though. This restricts the (original) set of all 
common intervals to the biologically more meaningful candidates. 

Examples. Let iV = {1, . . . , 6} and 7T = (tti, 7T2, tts) with tti = (+1, +2, +3, +4, 
+5, +6), 7T2 = (-3, -1, -2, +5, +4, +6), and TTg = (-4, +5, +6, -2, -3,-1). With 
respect to tti the interval [1, 3] is a common interval, but [4, 5] and [4, 6] are not. □ 

Obviously, the number of common intervals in signed permutations can be considerably 
smaller than the number of common intervals in unsigned permutations. Hence, apply- 
ing Algorithm Q followed by a filtering step will not yield our desired time-optimal 
result. 

However, the problem can be solved easily by applying the algorithm for multichro- 
mosomal permutations from the previous section. Since a common interval in signed 
permutations can never contain two genes with different sign, we break the signed per- 
mutations into pieces (“chromosomes”) wherever the sign changes. This is clearly pos- 
sible in linear time. Then we apply the algorithm from the previous section to the ob- 
tained family of multichromosomal permutations, the result being exactly the common 
intervals of the original signed permutations. Hence, we have the following 

Theorem 2. Given k signed permutations of N = n}, all K common intervals 

can be found in optimal 0{kn -F K) time using 0{n) additional space. □ 

5 Common Intervals of Circular Permutations 

As discussed in the Introduction, much of the DNA in nature is circular. Consequently, 
by representing genomes as (possibly multichromosomal) linear permutations of genes 
and then looking for common gene clusters, one might miss clusters that span across 
the (mostly arbitrary) dissection point where the circular genome is linearized. 

In this section we consider an arrangement of the set of genes N = {1, 2, . . . , n} 
along a circle and call this a circular permutation. Given a family 77 = (tti, . . . , tt^) of 
k circular permutations of TV, a (sub)set c C TV of genes is called a common interval if 
and only if the elements of c occur uninterruptedly in each circular permutation. 

Example 4. Let TV = {1, . . . , 6} and 77 = (tti, 7T2, tts) with tti = (1, 2, 3, 4, 5, 6), 
7T2 = (2, 4, 5, 6, 1, 3), and tts = (6, 4, 1, 3, 2, 5). Apart from the trivial intervals (TV, 
the singletons, and TV minus each singleton), the common intervals of 77 are {1, 2, 3}, 
{1,2, 3, 4}, {1,4, 5, 6}, {2, 3}, {4, 5, 6}, {5, 6}. □ 

In the following we will show how to find all K common intervals in a family 
of circular permutations in optimal 0{kn + K) time. Again, this can be done by an 
easy modification of the original algorithm from Section |3 in combination with the 
following observation. 

Lemma 3. Let c be a common interval of a family 77 of circular permutations of N. 
Then its complement c := N \ c is also a common interval of II. 



Proof. This follows immediately from the definition of common intervals of circular 
permutations. □ 
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Note that Lemma|3does not hold for irreducible intervals. 

The general idea is now to first find only the common intervals of size < [ , and 

then find the remaining common intervals by complementing these. The procedure is 
outlined in Algorithm El The main difference to Algorithm Qis that function (pi is re- 



Algorithm 4 (Finding All Common Intervals of k Circular Permutations) 

Input: A family 77 = (tti = idn, tt 2 , ■■■ ,nk) of k circular permutations of A = {1, . . . , n}. 
Output: The set of all common intervals Cn- 
1: ({1, 2}, {2, 3}, . . . , {n - 1, n}, {n, 1}) 

2: for i = 2, . . . , fc do 

3: ^ {ip* (c) I c G H (see text) 

4: end for 

5: generate Cjj from 1^ = In^, using LemmaQ 
6 : Cn^{c\ cG^n} 

7: output Cn U Cn 



placed by a variant, denoted (p*, that works on circular permutations and only generates 
Irreducible intervals of size < LtJ . This function is implemented by multiple calls to 
the original function ipi. The two circular permutations tti and are therefore lin- 
earized in two different ways each, namely by once cutting them between positions n 
and 1, and once cutting between positions [ and + 1. Then Lpi is applied to each 
of the four resulting pairs of linearized permutations. For convenience, the output of 
common intervals of length > [ is suppressed. Finally, the resulting intervals of the 
four runs of ipi are merged, sorted according their start and end positions using counting 
sort, and duplicates are removed. Clearly, ip* generates all irreducible intervals of size 
< in 0(n) time. Hence, we have the following 

Theorem 3. Given k circular permutations of N = n}, all K common inter- 
vals can be found in optimal 0{kn K) time using 0{n) additional space. □ 



6 Combination of the Algorithms 

In this section we show how to handle arbitrary combinations of multichromosomal, 
signed, and circular permutations. 

Combining multichromosomal and signed permutations is straightforward, but it is 
not obvious how to handle combinations which involve circular chromosomes without 
loosing the optimal running time. Circular chromosomes of different genomes might 
now have incompatible gene contents and Lemma Ejno longer holds as the following 
example shows. 

Example 5. Let TV = {1, . . . , 8} and 7T = (tti, 7T2) with tti = {(1, 2, 3, 4), (5, 6, 7, 8)} 
and 7T2 = {(1, 3, 5, 6, 7), (2, 4, 8)} where all chromosomes are circular. While c = 
{5, 6} is a common interval, its complement N\c= {1, 2, 3, 4, 7, 8} is not. □ 
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We overcome these problems by a preprocessing step where we include artibcial break- 
points into the genomes. The breakpoints do not affect common intervals but rehne 
the permutations so that they can be handled by our algorithms. The preprocessing is 
performed as follows. 

We compare permutation tti successively to each of the other permutations 
2 < i < k and test for each pair of neighboring genes in tti (i.e., for each chromo- 
some c = (7Ti(^), 7Ti(^ -F 1), . . . , 7ri(r)) the pairs {7Ti(j), 7Ti(j -F 1)} for ^ < j < r — 1, 
plus the pair {7 Ti(Z), 7Ti(r)} for circular c) if they lie on the same chromosome in 
or not. If not, they can not be elements of the same common interval and we introduce 
a new artibcial breakpoint between the two genes in tti. Then we do the same com- 
parison in the opposite direction, i.e., we introduce breakpoints between neighboring 
genes of iTi, 2 < i < k whenever they do not lie on the same chromosome of tti. At 
the brst time a breakpoint is inserted in a circular chromosome, the chromosome is lin- 
earized by cutting at the breakpoint and replacing it in the genome by the appropriately 
circularly shifted linear chromosome. Breakpoints in a linear chromosome dissect the 
chromosome. This preprocessing can be performed in 0{kn) time. 

After the preprocessing, the genes that do not occur in any circular chromosome 
can be handled by the algorithm for multichromosomal permutations (Section 0 in 
a straightforward way. The genes that occur in at least one circular chromosome are 
partitioned into sets of genes which correspond to a single circular chromosome. This 
partition is well debned, since the set of genes of each remaining circular chromosome 
corresponds, in the other genomes, either to one circular or to one or several linear chro- 
mosomes. Each element of this partition is now treated separately. We start by restrict- 
ing all genomes to the selected gene set. If each of the restricted genomes is circular 
we can apply the algorithm for circular permutations (Section 0) directly. Otherwise 
we choose a restricted genome that consists of one or several linear chromosomes and 
arrange the chromosomes in an arbitrary order. Denote I (r) the brst (last) gene in this 
order. We proceed as in the multichromosomal case (Section |3) except we encounter a 
circular genome tTc- If I and r are neighboring genes in tTc we linearize tTc by cutting 
between them and proceed as for a linear genome. Otherwise, similar to the case of 
circular permutations (Section 0, we copy tTc four times and linearize the copies by 
cutting one copy on the left of /, one copy on the right of I, one copy on the left of r, 
and one copy on the right of r. For each of these genomes we compute all irreducible 
intervals. The resulting intervals are merged, sorted according their start and end posi- 
tions using counting sort, and duplicates are removed. This procedure guarantees that 
we determine all irreducible intervals except for those, which contain I and r simulta- 
neously. But due to our choice of I and r there is at most one such interval, the trivial 
one, which contains all genes. We test this interval separately. 

Since the above described preprocessing and the modibcations of the algorithms 
for multichromosomal and circular permutations do not affect the optimal asymptotic 
running time, we have 

Theorem 4. Given k multichromosomal, signed, circular or linear (or mixed) permu- 
tations ofN = n}, all K common intervals can be found in optimal 0{kn-\-K) 

time using 0(n) additional space. □ 
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7 Conclusion 

In this paper we have presented time and space optimal algorithms for variants of the 
common intervals problem for k permutations. The variants we considered, multichro- 
mosomal permutations, signed permutations, circular permutations, and their combina- 
tions, were motivated by the requirements imposed by real data we were confronted 
with in our experiments. While in preliminary testing we have applied our algorithms 
to bacterial genomes, it is obvious that in a realistic setting, one should further relax 
the problem definition. In particular, one should allow for missing or additional genes 
in a common interval while imposing a penalty whenever this occurs. Such relaxations 
seem to make the problem much harder, though. 
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Abstract. An algorithmic aid is presented which identifies those amino 
acids in a peptide that are essential to bind against a specific monoclonal 
antibody. The input data comes from random peptide array screening 
experiments, which results in a binding strength for all these different 
peptides. On the basis of this data, the proposed algorithm generates a 
rule, which describes the amino acids that are necessary to ensure strong 
binding and consequently separates the best binding peptides from the 
worst binding ones. The generation of this rule is performed using only 
information about the occurrence of an amino acid in a peptide, i.e., 
it doesn’t use the relative position of the amino acids in the peptide. 
Results obtained from experimental data show that for several different 
monoclonal antibodies, the amino acids which are important according to 
the generated rules, coincide with amino acids included in motifs which 
are known to be important for binding. The information gained from 
this algorithm is useful for the design of subsequent experiments aimed 
at further optimization of the best binding peptides found during the 
peptide screening experiment. 



1 Introduction 

Synthetic peptides can be designed to bind against the same antibody as com- 
plete proteins do 0 . For this, the peptide should mimic the area of the protein 
which is recognized by the antibody. Although proteins normally consist of a 
large number of amino acids, the contact area between an antibody and a pro- 
tein generally consists of 15-22 amino acids on each side jOj. Several experiments 
suggest that only three to five of these amino acids are responsible for the ma- 
jor binding contribution between a protein and an antibody m- These amino 
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adds define the so called ‘energetic epitope’ p. In this paper, we focus on the 
determination of those amino acids that constitute this ‘energetic epitope’. 

The most obvious way to construct a synthetic peptide is by determining the 
amino acids comprising the epitope of the protein. However, these amino acids 
can be hard to identify and if they are identified, the usage of these amino acids 
will not always yield a well binding peptide. This can for example be due to a 
conformational difference between the amino acids at the protein surface and the 
amino acids in the peptide. When no well binding peptide can be constructed 
from a part of the original protein, a random search is generally performed to 
find a well binding peptide. Such a search can e.g. be done by a phage display 
analysis I1I7I10I or a peptide screening analysis 0. In a peptide screening analy- 
sis, thousands of randomly generated peptides can be measured against the same 
antibody. This results in a relative binding strength for every measured peptide. 
In general, the best binding peptide found during such an analysis is, however, 
far worse than required. Therefore, several lead variants are usually screened to 
improve the performance of the best binding peptides. 

Although thousands of different measurements are preformed during a single 
run of a peptide screening analysis, the number of tested peptides is relatively 
small compared to the number of possible different peptides. The peptides under 
study consist of 12 successive amino acids, implying that one can generate 20^^ 
unique peptides. In principle, one should test all these possibilities to find the 
best binding peptide. Naturally, this is not feasible and hence a feasible search 
procedure is required. 

One logical method to search for such a well binding peptide is by selecting 
several of the best known binding peptides and test a large number of mutations 
of these peptides. But, again due to the large search space, only a limited number 
of these mutations can be tested during one experiment. Still the question re- 
mains which (combination of) amino acids of these peptides should be replaced. 
In this paper we propose an algorithm that is meant to help in this decision by 
identifying those amino acids which have a higher chance of being important for 
binding. Effectively this reduces the search space and makes it possible to spend 
more effort in mutating the other amino acids of these peptides. 

The method generates a rule containing a combination of amino acids which 
is over-represented in the best binding and underrepresented in the worst binding 
peptides. Because the amino acids in the rule are over-represented amongst the 
best binding peptides, these amino acids will most probably improve the binding 
strength and are most probably contained in the ‘energetic epitope’. 

2 Data 

The data used in this paper is obtained from four distinct peptide screening 
experiments, where one experiment has been performed for every monoclonal 
antibody described. In each experiment the binding strength of either 3640 or 
4550 randomly generated peptides is measured against an antibody. Each of 
these peptides consists of 12 amino acids, where each amino acid has an equal 
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probability of occurring at a given position in the peptide. The resulting values 
represent the binding strength between the different peptides and the monoclonal 
antibody at a fixed dilution. 

The peptides are sorted on their binding strengths to place peptides with a 
similar binding strength close together, several (scaled) binding curves are shown 
in Sectional 



3 Method 



The actual binding process between a peptide and an antibody is very complex 
and depends on the interplay between several amino acids. In this paper, we do 
not consider the full complexity of the problem, rather, our approach is based on 
the principle that some amino acids are typically needed in a peptide to ensure 
a high affinity binding against an antibody. 

To find these amino acids, we need a representation that does not include the 
positional information of the peptide’s amino acids. To this end, each peptide 
is represented by a 20 dimensional feature vector. Each feature represents the 
occurrence of a specific amino acid in the peptide (only the 20 naturally occur- 
ring L-amino acids are used to generate the peptides). Thus, if an amino acid 
is included in the peptide (even if it occurs more than once) the corresponding 
feature is set to 1 otherwise to 0. An example of this representation is given in 
Table Table d Our aim is to design an algorithm that constructs, for a given 



Table 1. Splitting a Peptide up in 20 Features. 





Peptide 








Feature 










A 


C 


D 


E 


F 


G 


H 


I 


K 


L 




1 


QGWFFMQINTQY 


0 


0 


0 


0 


1 


1 


0 


1 


0 


0 




2 


LMWNPNIKTCER 


0 


1 


0 


1 


0 


0 


0 


1 


1 


1 




3 


CDSADVSGDHLI 


1 


1 


1 


0 


0 


1 


1 


1 


0 


1 




4 


MHGVIAQGQQDV 


1 


0 


1 


0 


0 


1 


1 


1 


0 


0 




5 


FHNQLYYSPDYV 


0 


0 


1 


0 


1 


0 


1 


0 


0 


1 




6 


HEEWFQLFYYMQ 


0 


0 


0 


1 


1 


0 


1 


0 


0 


1 





antibody, based on the measured data, a logical rule which describes the amino 
acids that are needed in a well binding peptide. This rule maps each of the 
measured peptides to one of two classes: 1) ‘well binding’ peptides that con- 
tain the necessary amino acids and thus satisfy the rule and 2) ‘bad binding’ 
peptides which do not contain the required amino acids and do not satisfy the 
rule. Since the resulting rule will be used as a decision support in the analysis 
of the measurements, it is imperative that the generated rule is interpretable. 
Therefore, we chose to construct rules using logical expressions of amino acids 
(allowing only AND and OR constructions). An example is given in equation 
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IF (X OR X OR •••) AND (X OR X OR • • •) AND ••• (1) 

THEN well binding ELSE bad binding 

Each X in equation m can be replaced by any of the 20 different amino acids. 

A classification result of a possible rule is shown in Figure Figure P where 
for every peptide a dot is drawn, indicating whether the peptide is classified as 
well binding (a dot at 100) or as bad binding (a dot at zero). On the left side 
of the figure, the number of well binding peptides is larger than the number 
of bad binding ones. Clearly, we now have a situation where the rule associates 




peptide no. 



Fig. 1. Classification of Several Peptides. A dot at 100 means that the corre- 
sponding peptide is classified as well binding. The continuous line shows the 
estimated well binding distribution. 

every peptide with a binary value (well binding/bad binding) while the data 
set associates every peptide with a continuous valued binding strength. Since 
the binding strength gradually transitions from the largest to the smallest value 
it is undesirable to define a crisp boundary between the two categories. 

Therefore, to evaluate the rule, we need an additional requirement. This re- 
quirement is built into the algorithm as follows: It is assumed that the measured 
binding strength of a given peptide is proportional to the similarity between 
this peptide and the best n binding peptides. According to this assumption, we 
may interpret the measured binding strength as a measure of the probability 
density of the well binding peptides (over the sorted peptides). By requiring 
that the distribution of the well binding peptides equals the binding energy 
curve, the classification rule can be optimized. Clearly this will have the desired 
result that peptides having a high binding energy will be more often classified 
as well binding and vice versa. 

That leaves us with the question how to find the distribution of the well 
binding peptides given the (binary) output of the classification rule, as in Fig- 
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ure Figured Here we convolve the output of the classification rule with a sliding 
window that calculates the percentage of well binding peptides in that win- 
dow (resembling a Parzen estimation). Figure Figure ^hows such a resulting 
distribution when using an averaging window of 25 peptides wide. At the bor- 
der we clip the averaging window, such that the filtered line has the same size 
as the number of measured peptides. The average value for the first peptide 
is calculated using only half of the window size (13 peptides). The next one is 
calculated using 14 peptides, etc. Clearly, the number of peptides used in this 
figure is smaller than the number of peptides used in the experiment. On the 
real data sets a window size of 100 was employed. 

The estimated distribution can now be compared with the measured binding 
energy. Figure FigureO shows an example of such a comparison. In this figure, the 




Fig. 2. Comparison between the filtered measured binding energy and the es- 
timated well binding distribution. The line separating Area 1 and Area 2 is 
automatically placed at the ‘knee’ of the measured binding curve. The position 
of the knee is derived from the intersection point of two straight lines fitted to 
the binding curve. 



measured binding energy curve is rescaled to fit between 0 and 100. Additionally, 
we filtered the binding energy in the same way as the classified peptides were 
filtered. The latter was done to improve the comparison between the two curves 
by diminishing the effect of the different preprocessing steps of the data (e.g. the 
value of the best binding peptides is considerably reduced by the averaging) . 

The classification rule is now optimized by minimizing the difference between 
the binding energy curve and the well binding distribution. This is done by 
minimizing the root mean square error between the two curves (i.e., the dashed 
area in Figure Figure EJ). 

Inspection of the binding energy curve reveals that this curve has a sharp 
peak on the extreme left side. In other words the well binding peptides are 
underrepresented in the data set in comparison to the bad binding peptides. 
This asymmetric distribution has a severe consequence in the calculation of the 
root mean square error difference between the two curves: i.e., errors made on the 
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bad binding peptides accumulate to constitute a large percentage of the overall 
error. This may have the undesirable effect that after minimization of the error, 
the binding curve is well approximated in the area where bad binding peptides 
occur, while the peak, i.e., the section containing the best binding peptides (in 
which we are interested) is poorly approximated. 

To overcome this undesired behavior, the calculation of the root mean square 
error (RMS) between the measured binding energy and the estimated ‘well 
binding’ distribution is split in two terms: 1) one for the peptides above the 
background binding strength (Region 1 in Figure Figure El) and 2) one for the 
other elements (Region 2 in Figure Figure El). The final error measure is the 
weighted sum of the RMS error in each of these areas, where each of the terms 
is weighted by the number of samples involved: 

Cost = ^ ^ ( 2 ) 

ni ri2 

where ni and represent the number of samples associated with the two terms. 

The optimization algorithms applied, try to minimize this measure by gener- 
ating a rule of the form given in Equation Q, i.e., a rule consisting of the logical 
AND composition of a set of terms. Each term is defined here as the logical OR 
composition of a set of amino acids. In this paper, a genetic algorithm and a 
greedy algorithm are employed to find this rule. 

For the genetic algorithm, the rule is coded in a chromosome which contains 
20 elements for each OR term (each element codes for the usage of one of the 
20 amino acids in the OR term). This is necessary to code for every possible 
combination of amino acids in one OR term. A result of this algorithm is shown 
in Section 0 

Most results presented in this paper are generated by a greedy algorithm. 
This algorithm starts with an empty rule. During the first step the algorithm 
selects that amino acid which results in the lowest cost. This results in a rule 
with a single term consisting of only one amino acid. In subsequent steps, the 
rule is modified by either 1) addition of a new amino acid to one of the existing 
terms, 2) deletion of one amino acid from one of the terms 3), addition of a new 
(single amino acid) term or 4) deletion of a term consisting of a single amino 
acid. A greedy choice is made between all these possibilities. 

At every iteration step, the algorithm is forced to add or remove one amino 
acid to the rule. Even when the resulting rule has a higher cost than the current 
one. If the current solution is already the optimal one, the algorithm is forced 
to generate a rule with a lower performance. The optimal solution, however, 
will be found back in the next iteration step. The advantage of this enforced 
addition/removal of one amino acid at every iteration step is that it gives the 
algorithm the possibility to escape from a small local optimum. 

Eventually, when the algorithm is unable to find a better solution, it will 
iterate between the best rule and the second best rule found. If this happens, 
the algorithm terminates and the best rule is returned. 
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4 Results 

The proposed algorithm is applied to four different data sets. The results are 
discussed in this section. Except when stated otherwise, the distribution of well 
binding peptides have been derived with an averaging window of width 100. The 
percentage shown on the vertical axes of the figures represents the value of the 
distribution and can be interpreted as the percentage of peptides in the vicinity 
of a given peptide which are classified as well binding. 



mAb A0. 



Figure 0 shows the result for monoclonal antibody A. The dashed 




Fig. 3. Results of Monoclonal Antibody A. The dashed line is the rescaled ver- 
sion of the measured relative binding strength. The solid line gives the estimated 
distribution of well binding peptides when applying the rule in Equation (0. 



line in the figure is the rescaled measured binding curve (the values are rescaled 
to fit in the same window as the results). The solid line is the distribution of 
well binding peptides when using the rule in Equation (0). 

Well binding ^ (G OR V) AND F AND T (3) 

The figure shows that both curves resemble each other quite well, which gives 
a good indication that it is possible to classify the data in this way. On the 
left side of the figure, the estimated distribution starts at approximately 70% 
implying that the algorithm could not identify a pattern in some of the best 
binding peptides. 

For this antibody it is known that the motif FT improves the relative binding 
strength considerably. This kind of information is not known for the amino acids 
G or V, but a large number of the well binding peptides containing F and T also 
contain a G or a V. Including these elements in the rule makes it possible to reduce 
the number of false positives. This can be understood as follows: increasing 
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the size of the rule without increasing the number of misclassified well binding 
peptides reduces the number of false positives. The larger the number of amino 
acids in a peptide that need to fulfill the rule, the smaller the number of peptides 
that will be classified as well binding. 

The genetic algorithm is also applied on this data. According to our imple- 
mentation of this algorithm, the number of OR terms have to be predefined. 
The best result found using two, three and four terms are shown in Equations 
©-©• 



Well binding (G OR 
Well binding (Q OR 
Well binding => (C OR 
(1 OR 



H 


OR 


V) 


AND F 


S 


OR 


V) 


AND F 


G 


OR 


V) 


AND F 


L 


OR 


s 


OR V) 



(4) 

AND T (5) 

AND T AND 

( 6 ) 



The result of has a lower cost than the one proposed by the greedy algorithm. 
But it does not deviate from the greedy result in stating that F and T are the 
two most important amino acids. So, the information gained from the greedy 
algorithm is comparable to the result of the genetic network. 

The greedy algorithm and the genetic algorithm also performed comparable 
on the other data sets. We prefer the greedy algorithm, while this algorithm is 
much faster than the genetic algorithm. Therefore, only results obtained with 
the greedy algorithm will be presented in the rest of this paper. 

Using the rule generated by the greedy algorithm Q, peptide no. 3 in Ta- 
ble TableElis classified as a bad binding peptide. This is partly due to the lack 



Table 2. The Five Best Binding Peptides of Monoclonal Antibody A. 



no. 


Peptide 


Result 


1 

2 

3 

4 

5 


CMEQ I MRGTPFT 
KLSTP I RHGAFT 
L I H Y D N T N M W YT 
VSFMFLAPTGFT 
MWPLPWVA I P FT 


Well binding 
Well binding 
Bad binding 
Well binding 
Well binding 



of the amino acid F in the peptide. In this case the quite similar amino acid 
Y probably performs the same contribution to the binding as F normally does. 
This substitution effect is obviously not included in the algorithm. 

Equation is created using all the available peptides. To check the stability 
of the algorithm the results are also calculated after repeatedly removing 10% 
of the measured peptides from the training set. The resulting 10 rules are shown 
in equation ( 0 : 



1: Well binding =» (G OR V) AND F AND T 
2: Well binding (G OR V) AND F AND T 
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3: Well 


binding 


(G 


OR 


V) 


AND 


F 


AND 


T 


4: Well 


binding 


(G 


OR 


K) 


AND 


F 


AND 


T 


5: Well 


binding => 


(G 


OR 


V) 


AND 


F 


AND 


T 


6: Well 


binding => 


(G 


OR 


M) 


AND 


F 


AND 


T 


7: Well 


binding => 


(G 


OR 


V) 


AND 


F 


AND 


T 


8: Well 


binding 


(G 


OR 


V) 


AND 


F 


AND 


T 


9: Well 


binding 


(G 


OR 


V) 


AND 


F 


AND 


T 


10: Well 


binding 


(G 


OR 


V) 


AND 


F 


AND 


T AND 






(A 


OR 


D 


OR K 


OR R OR S) 



The term (A OR D OR K OR R OR S) in the last rule is true for 97% of the 
peptides. This term is probably due to an error made in an early greedy step in 
the algorithm. When an OR group doesn’t improve the result of the algorithm 
and it has two or more elements, the algorithm can only increase the number of 
amino acids in the group to reduce its effect. The V seams to be less important 
because it is replaced once by a K and once by an M. In all ten rules, the F and 
T have to be included in a peptide to be classified as well binding. 

mAb 32F81. The number of well binding peptides for monoclonal antibody 
32F81 is much larger than for mAb A, which can be seen in Figure^ Again the 




Fig. 4. Results of Monoclonal Antibody 32F81. The dashed line is the rescaled 
version of the measured relative binding strength. The solid line gives the per- 
centage of elements which fulfill Equation (0 . 



algorithm is able to find a rule which approximates the binding strength curve 
quite well. The rule found for this antibody (Equation OSJ), describes the data 
better in the sense that 80% of the best binding peptides are classified as well 
binding. The right part of the predicted result is remarkable (after peptide 
4000). Here the distribution of well binding peptides rises while the measured 
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binding energy shows a rapid decay. Some of the peptides which occur in this 
area show an exceptionally low binding strength in several different peptide 
screening experiments. The reason for this exceptionally bad binding behavior 
of these peptides is not found by the algorithm. 

The best five binding peptides of monoclonal antibody 32F81 are shown in 
Table Table 0 



Table 3. The Five Best Binding Peptides of Monoclonal Antibody 32F81. 



no. 


Peptide 


Result 


1 

2 

3 

4 

5 


FKC TADQYVTFA 
QG S WDKFGMNEH 
LPWHEILKQNVN 
DKIENYIQWP HD 
EKGDHQCAQGVY 


Well binding 
Well binding 
Well binding 
Well binding 
Well binding 



Well binding (E OR T) AND K AND (D OR F OR H) (8) 

All peptides shown fulfill the rule and are classified as well binding peptides. 
For this peptide it should be noted that the result using only the amino acid K is 
already quite discriminative: 96% of the 400 best binding peptides contain this 
amino acid. The comparison between the distribution found by the algorithm 
and the distribution using a rule containing only K is shown in Figure El 




Fig. 5. Comparison of the resulting rule of the algorithm (0 and the rule: K, for 
monoclonal antibody 32F81. 



The result using the rule ‘K’ has almost a 100% accuracy for the best binding 
peptides. However the number of false positives is also quite high. The cost 
function we employ prefers the answer with a lower number of false positives. 
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mAb B. A part of the distribution of monoclonal antibody B is shown in Figu- 
re El The number of well binding peptides is extremely small for this experiment. 




Fig. 6. Part of the Result for Monoclonal Antibody B. The dashed line is the 
rescaled version of the measured relative binding strength. The solid line gives 
the percentage of elements which fulfill the rule in Equation (Q (calculated using 
an averaging window of 16). 



This makes it necessary to reduce the size of the averaging window when esti- 
mating the distribution. Namely, when a window size of 100 would be employed 
here, the maximal value of the estimated distribution would be 15%, effectively 
removing the peak of the distribution. To avoid this, the averaging window has 
been reduced to a length of 16. A disadvantage of such a short window is the 
larger fluctuations of the predicted result (as can be seen in the tail of the curve 
in Figure Figure |0|). 

The rule used to generate Figure Figure El is given in equation (jOJ. A known 
well binding motif for this peptide is TEDSAVE. The amino acids A, V and E of 
this motif are included in the rule found by the algorithm. 

Well binding ^ A AND (E OR N) AND (l OR V) (9) 

The five best binding peptides for this antibody are given in Table Table 01 The 



Table 4. The Five Best Binding Peptides of Monoclonal Antibody B. 



no. 


Peptide 


Result 


1 

2 

3 

4 

5 


DAAEDDVNGRFM 
VTNC CRSAVHEK 
RNMSKTSASAVE 
WELWIKKKAKVV 
EAIWKKCKGVTF 


Well binding 
Well binding 
Well binding 
Well binding 
Well binding 



AVE motif is present in the third peptide of this table. 



Determination of Binding Amino Acids 275 



mAb 6A.A6. The last result is for monoclonal antibody 6A.A6 (Figure Figu- 
re El) The distribution of well binding peptides is estimated quite well for the 




Fig. 7. Results for Monoclonal Antibody 6A.A6. The dashed line is the rescaled 
version of the measured relative binding strength. The solid line gives the per- 
centage of elements which fulfill the rule in Equation m- 



best binding peptides. However, for the bad binding peptides, the distribution 
does not fit the actual measured binding energy well, resulting in a large number 
of false positives. An interesting element of rule 11 Oil is the OR part (D OR E). 
These two amino acids are very similar (see e.g. 0) and can often be exchanged. 

Well binding ^ S AND (D OR E) (10) 

The best known binding peptide against this antibody has the following se- 
quence: SRLPPNSDVVLG. The amino acids S and D are included in the peptide, 
thus the best known peptide will be classified as a well binding peptide. Re- 
placement studies of this peptide show that the motif PNSD is very important 
for a well binding result 0. This pattern is much longer then the result found 
by the algorithm. This might be caused by the lack of very well binding pep- 
tides that have this motif in the data set. It should be noted that the binding 
strength of even the best measured peptides of the random data set are much 
smaller then this best known result. The five best binding peptides are shown 
in Table Table El The known important motif PNSD is not available among these 
peptides. 

5 Discussion 

We have introduced an algorithm that identifies those amino acids in a peptide 
which are most probably needed to bind against a monoclonal antibody. The 
relevant amino acids are described by a rule that distinguishes between well and 
bad binding peptides. The algorithm we have introduced automatically generates 
such a rule description from peptide screening experimental data. 
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Table 5. The Five Best Binding Peptides of Monoclonal Antibody 6A.A6. 



no. 


Peptide 


Result 


1 

2 

3 

4 

5 


RSDSMGILLLPL 
PQSDTLHYCIQN 
C S D V V ETRCQ I D 
ESN TWVKCYSYW 
EFMSDIALRGTV 


Well binding 
Well binding 
Well binding 
Well binding 
Well binding 



This is based on the fact that if a certain motif or combination of amino acids 
improves the binding strength between a peptide and a monoclonal antibody, 
most of the peptides containing this motif will bind better than average against 
this antibody. This results in an over-representation of these amino acids in the 
better binding peptides. Such an over-representation of amino acids is utilized 
by the proposed algorithm to generate a rule of amino acids that distinguishes 
between well binding and bad binding peptides. 

A greedy optimization algorithm is employed, which means that it does not 
necessarily find the rule with the minimal cost. Other optimization procedures 
like genetic algorithms are able to perform better in finding the absolute opti- 
mum. 

Better solutions than those obtained with the greedy approach were in our 
experience, only marginally better in terms of the cost function, and were char- 
acterized by more OR terms (like the term (A OR D OR K OR 
R OR S) in the last rule in Equation d3)). 

It is important, however, to keep in mind that a large number of peptides 
will fulfill these long OR terms, i.e., such rules are non-specific. Since these rules 
are employed to generate new peptides for subsequent analysis, we are more 
interested in rules with short OR terms which reduce the search space by their 
specificity. 

We could adapt the cost function to achieve this by adding a penalty term 
for long rules. However, if we look more carefully at the greedy optimization, 
we notice that it starts by including the most discriminating amino acids in the 
rule. This results in a reasonable performance after only a few iteration steps. 
Additional OR terms are only included when this improves the performance of 
the rule (in fact the performance should improve when normally one, but at most 
two additional amino are included in the OR term) . This greedy property of the 
algorithm reduces the chance that these unimportant long terms are included in 
the final rule. 

The rules as such can not be directly employed to generate new well binding 
motifs, since positional information and frequency of occurrence is lost in the 
feature representation. However, within the group of well binding peptides that 
satisfy the rule, the amino acids specified in the rule occur at specific positions 
in these peptides. A logical approach is to construct new candidate peptides by 
preserving for the N best binding peptides, those amino acids that appear in the 
rule, while randomly mutating the rest. 
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Abstract. A central issue in molecular biology is understanding the regulatory 
mechanisms that control gene expression. The recent flood of genomic and post- 
genomic data opens the way for computational methods elucidating the key com- 
ponents that play a role in these mechanisms. One important consequence is 
the ability to recognize groups of genes that are co-expressed using microarray 
expression data. We then wish to identify in-silico putative transcription factor 
binding sites in the promoter regions of these gene, that might explain the co- 
regulation, and hint at possible regulators. In this paper we describe a simple 
and fast, yet powerful, two stages approach to this task. Using a rigorous hyper- 
geometric statistical analysis and a straightforward computational procedure we 
find small conserved sequence kernels. These are then stochastically expanded 
into PSSMs using an EM-like procedure. We demonstrate the utility and speed of 
our methods by applying them to several data sets from recent literature. We also 
compare these results with those of MEME when run on the same sets. 



1 Introduction 

A central issue in molecular biology is understanding the regulatory mechanisms that 
control gene expression. The recent flood of genomic and post-genomic data, such as 
microarray expression measurements, opens the way for computational methods eluci- 
dating the key components that play a role in these mechanisms. 

Much of the specificity in transcription regulation is achieved by transcription fac- 
tors, which are largely responsible for the so called combinatorial aspects of the regu- 
latory process (the number of possible behaviors being much larger than the number of 
factors). These are proteins that, when in the suitable state, can bind to specific DNA 
sequences. By binding to the chromosome in a location near the gene, these factors can 
either activate or repress the transcription of the gene. While there are many potential 
sites where these factors can bind, it is clear that much of the regulation occurs by fac- 
tors that bind in the promoter region which is located upstream of the transcription start 
site. 

Unlike DNA-DNA hybridization, the dynamics of protein-DNA recognition are not 
completely understood. Nonetheless, experimental results show that transcription fac- 
tors have specific preference to particular DNA sequences. Somewhat generalizing, the 
affinity of most factors is determined to a large extent by one or more relatively short 
regions of 6-lObp. (One must bear in mind that DNA strands span a complete turn 
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every 10 bases, thus geometric considerations make it unlikely that a single protein 
binds to a longer region, although counterexamples are known.) A common situation is 
the formation of dimers in which two DNA binding proteins form a complex. Each of 
the two proteins, binds to a short sequence, and together they bind to a sequence that 
can be 12-18bp long, with a short spacer separating the two regions. Common protein 
motifs such as the DNA binding Helix-Turn-Helix (HTH) motif also induce the same 
preference on the regulatory site. 

The recent advances in microarray experiments allow to monitor the expression 
levels of genes in a genome-wide manner 181911411.512212.^1. An important aspect of 
these experiments is that they allow to find groups of genes that have similar expres- 
sion patterns across a wide range of conditions mi- Arguably, the simplest biological 
explanation of co-expression is co-regulation by the same transcription factors.Q 

This observation sparked several works on in-silico identification of putative tran- 
scription factor binding sites 11411711912(112111 . The general scheme that most of these pa- 
pers take involves two phases. First, they perform, or assume, some clustering of genes 
based on gene expression measurements. Second, they search for short DNA patterns 
that appear in the promoter region of the genes in each particular cluster. These works 
are based to a large extent on methods that were developed to find common motifs in 
protein and DNA sequences. These include combinatorial methods l lhll 91211241251 . pa- 
rameter optimization methods such as Expectation Maximization (EM) [[O, and Markov 
Chain Monte Carlo (MCMC) simulations 111 812(1 . See 1191 for a review of these lines 
of work. 

The use of expression profiles helps to select relatively “clean” clusters of genes 
(i.e., most of them are indeed co-regulated by the same factors). Our interest here lies 
with the second phase, and is thus not limited to gene expression analysis. Given high 
quality clusters of genes, suspected for any reason to be co-regulated, we address the 
hardness of the computational problem of finding putative binding sites in these clusters. 

In this paper we describe a fast, simple, yet powerful, approach for finding putative 
binding sites with respect to a given cluster of genes. Like some of the other works we 
divide this phase into two stages. In the first stage we scan, in an exhaustive manner, 
for simple patterns from an enumerable class (such as all 7-mers). We use a straight- 
forward, natural, and well understood statistical model for filtering significant patterns 
out of this class. Using the hyper- geometric distribution, we compute the probability 
that a subset of genes of the given size will have these many occurrences of the pat- 
tern we examine, when chosen randomly from the group of all known genes. In the 
second stage, we use the patterns that were chosen as seeds for training a more ex- 
pressive position-specific scoring matrix (PSSM) to model the putative binding site. 
These models are both more accurate representation of the binding site, and potentially 
capture much longer conserved regions. 

By assuming that most binding sites do contain highly conserved short subsequences 
and by explicitly using our post-genomic knowledge of all known and putative genes 
to contrast clusters of genes against the genome background, we acquire quality seeds 



’ Clearly this is not always the case. Co-regulation can be achieved by other means, and similar 
expression patterns can be a result of parallel pathways or a close serial relationship. Nonethe- 
less, this is often the case, and a reasonable hypothesis to test. 
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for the construction of PSSMs through a simplihed hyper-geometric model. The seeds 
allow us to track down potential binding site locations through a specihc relatively con- 
served region within them. We then use these short seeds to guide the construction of 
potentially much longer PSSMs encompassing more, or possibly the complete bind- 
ing site. In particular, they allow us to align multiple sequences without resorting to an 
expensive search procedure (such as MCMC simulations). 

Indeed, an important feature of our approach is the evaluation speed. Once we finish 
a preprocessing stage, we can evaluate clusters very efficiently. The preprocessing is 
genome-wide and not cluster specific. It can be done only once and stored for all future 
reference. This is important both for facilitating interactive analysis, and for serving 
as computationally-cheap quality starting points for other, more complex analysis tools 
(such as S2I) on top of our method. 

In the next three sections we outline our algorithmic approach, discussing signifi- 
cance of events, seed finding, and seed expansion into PSSMs, respectively. In Section E] 
we describe experimental and comparative results, and then conclude with a discussion. 

2 Scoring Events for Significance 

2.1 Preliminaries 

Suppose we are given a set of genes Q. Ideally, these are all the known and putative 
genes in a genome. With each gene g £ Q we associate a promoter sequenceQ Sg. For 
simplicity we assume that each of these sequences is of the same size, L. 

Suppose we are now given a subset of genes G <Z Q suspected to be co-regulated by 
some transcription factor. (For example, based on clustering of genes by their expres- 
sion patterns.) Our aim is to find patterns in the promoter region of these genes, that we 
will consider as putative binding sites. The assumption being that the co-regulation is 
mediated by factors that are present in most of the genes in group G, but overall rare in 
Q. Thus, a pattern is considered significant if it is characteristic of G compared to the 
background Q. 

Before we discuss what constitutes a pattern in our context, we address the basic 
statistical definition of a characteristic property. Suppose we find a pattern that appears 
in the promoter sequences of several genes in G. How do we measure the significance 
of these appearances with respect to 0? A related question one may ask, is whether the 
set G is significantly different, in terms of the composition of its upstream region, from 

Q. 

For now, we concentrate on events occurring in the promoter region of a gene. We 
focus on binary events, such as “sg contains the subsequence ACGTTCG or its reverse 
complement”. Alternatively, one can consider counting the number of occurrences of an 
event in each promoter sequence, e.g., “the number of times the subsequence ACGTTCG 
appears in Sg”. The analysis of such counting events, while attractive in our biological 
context, is more complex, in particular since multiple occurrences of an event in a se- 
quence are not independent of each other. See for approximate solutions to this 

problem. 

^ Or an upstream region that best approximates it, when the transcription start site is unknown. 
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Formally, a binary event E is defined by a characteristic function I ^ : 

{A, C, G,T}* — !■ {0, 1}, that determines whether that event occurred or not in any 
given nucleotide sequence. Given a set G, we define ^ e{G) = i^g) to be the 

number of times E occurs in the promoter regions of group G. We want to assess the 
significance of observing E at least f^siG) times in G, when taking the set of genes Q 
as the background for our decision. 

There are two general approaches for testing such significance. In both cases we 
compute p-values: the probability of the observations occurring under the null-hypothe- 
sis. This value serves as a measure of the significance of the pattern - the lower p- value 
is, the more plausible it is that an observation is significant, rather than a chance artifact. 
The two approaches differ, however, in the nature of each null-hypothesis. 

2.2 Random Sequence Null Hypothesis 

In this approach, the null hypothesis assumes that the sequences s g for g G G are gener- 
ated from a background sequence model Pq{s). This background distribution attempts 
to model “prototypical” promoter regions, but does not include any group-specific mo- 
tifs. Thus, if the event E detects such special motifs, then the probability of randomly 
sampling genes that satisfy E is small. 

The background sequence model can be, for example, a Markov process of some 
order (say 2 or 3) estimated from the sequences in Q (or, preferably, from Q — G). Using 
this background model we need to compute the probability p ^ ~ (s) = 1) that a 

random sequence of Length L will match the event of interest. Now, if we also assume 
under the null hypothesis that the n sequences in G are independent of each other, then 
the number of matches to E in G is distributed Bin{n,p e). We can then compute 
the p- value of finding ^ e{G) or more such random sequences by the tail weight of a 
Binomial distribution. 

The key technical issue in this approach is computing This, of course, depends 
on the assumed form of the background distribution, and on the complexity of the 
event. However, even for the simple definition of a pattern as an exact subsequence 
(i.e., /e(s) = 1 iff s contains a specific subsequence) and background probability of 
the form of an order 1 Markov chain, the required computation is not trivial. This forces 
the development of various approximations to p £; of varying accuracy and complexity 

EEED- 

2.3 Random Selection Null Hypothesis 

Alternatively, in the approach we focus on here, one does not make any assumption 
about the distribution of promoter sequences. Instead, the null hypothesis is that G was 
selected at random from 5, in a manner that is independent of the contents of the genes’ 
promoter regions. 

Assume that K = :^e{G) out of = \Q\ genes satisfy E. Thus, we require the 
numbe^ of genes that satisfy E in Q. The probability of an observation under the null 
hypothesis is the probability of randomly choosing n = \G\ genes in such a way that 



^ But not the identity, simplifying the implied underlying in-vitro measurements. 



282 Yoseph Barash, Gill Bejerano, and Nir Friedman 



k = ^e{G) of them include the event E. This is simply the hyper- geometric probabil- 
ity of finding k red-balls among n draws without replacement from an urn containing 
K red balls and N — K black ones: 



Phyper{k \ n,K,N) = 

The p- value of the observation is the probability of drawing k or more genes that satisfy 
E inn draws. This requires summing the tail of the hyper-geometric distribution 

n 

p-value{E, G) = ^ Phypei{k' \ n, K, N) 

k' — k 

The main appeal of this approach lies in its simplicity, both computationally and sta- 
tistically. This null hypothesis is particularly attractive in the post-genomic era, where 
nearly all promoter sequences are known. Under this assumption, irrelevant clustering 
selects genes in a manner that is independent of their promoter region. 

2.4 Dealing with Multiple Hypotheses 

We have just defined the significance of a single event E with respect to a group of 
genes G. But when we try many different events Ei, , Em over the same group of 
genes long enough, we will eventually stumble upon a surprising event even in a group 
of randomly selected sequences, chosen under the null hypothesis. 

Judging the significance of findings in such repeated experiments is known as mul- 
tiple hypotheses testing. More formally, in this situation we have computed a set of p- 
values pi, . . . , p^ , the smallest corresponding to the most surprising event. We now ask 
how significant are our findings considering that we have performed M experiments. 

One approach is to find a value q — q{M), such that the probability that any of the 
events (or the smallest one) has a p-value less than q is small. Using the union bound 
under the null hypothesis we get that 

P(minp^ <t)< P{pm < q) = M ■ q 

m 

Thus, if we want to ensure that this probability of a false recognition is less than 0.01 
(i.e., 99% confidence), we need to set the Bonferroni threshold q = (see, for ex- 
ample, (nil). 

The Bonfferoni threshold is strict, as it ensures that each and every validated scoring 
event is not an artifact. Our aim, however, is a bit different. We want to retrieve a set of 
events, such that most of them are not artifacts. We are often willing to tolerate a certain 
fraction of artifacts among the events we return. A statistical method that addresses this 
kind of requirement is the False Discovery Rate (FDR) method of [ 01 . Roughly put, the 
intuition here is as follows. Under the null hypothesis, there is some probability that the 
best scoring event will have a small p-value. However, if the group was chosen by the 
null hypothesis, it can be shown that the p- values we compute are distributed uniformly. 




Simple Binding-Site Discovery Algorithm 283 



Thus, the p-value of the second best event is expected to be roughly twice as large as 
the p-value of the best event. Given this intuition, we should be less strict in rejecting 
the null hypothesis for the second best pattern and so on. 

To carry out this idea, we sort the events by their observed p-values, so that pi < 
P 2 < • ■ • < Pm - We then return the events Ei, . . . , Ek where k < M is the maximal 
index such that pk < ^ and q is the significance level we want to achieve in selecting. 
We have replaced a strict validation test of single events, with a more tolerable version 
validating a group of events. We may now detect significant patterns, weaker than the 
most prominent one, that were previously below the threshold computed for the later. 



3 Finding Promising Seeds 

3.1 Simple Events 

We want to consider patterns over relatively short subsequences. We fix a paramefer i 
that determines the length of the sequences we are interested in. Events are then defined 
over fhe space of 4^ ^-mers. 

Arguably the simplest ^-mer pattern is a specific subsequence (or consensus). Thus, 
if cr is an f-mer it defines the event “a is a subsequence of s”. A useful aspect of such 
events, is that they are exhaustively enumerable for the range of f we are interested in. 
This suggests examining all f-mer patterns in G and ranking them according to their 
significance. 

However, known binding sites that are identified by biological assays, display vari- 
ability in the binding sequence. Thus, we do not expect to see only exact matches to the 
f-mer consensus. Instead, we want to allow approximate matches when we search G. 
To formalize, consider a distance measure between two f-mers, c?(cr, cr'). The simplest 
such function is the hamming distance. However, we may consider more realistic func- 
tions, such as distances that penalize changes in a position specific manner. (Biology 
suggests, for example, that central positions in short binding sites are more conserved.) 
For concreteness, we focus on the hamming distance measure in the reminder of the 
paper. However, we stress that the following discussion applies directly to any chosen 
distance measure. 

Let cr be an £-mer. We define a <5-ball cenfered around cr fo be the set Ball 5 (cr) of 
£-mers that are of distance at most 5 from cr. Thus, in the hamming distance, example, 
Balli(AAA) = {AAA, CAA, GAA, TAA, ACA, AGA, ATA, AAC, AAG, AAT}. We match an 
event E with Ball^ (cr) such that (s) = 1 iff s or its reverse complementary contain 
an f-mer e BalU (cr). 

Given £ and S we wish to examine all balls that have at least one occurrence in Q 
(the rest will never appear in any sub group). Balls that occur in all genes in Q are also 
discarded (as they occur in all genes of any sub group). We denote this set of non-trivial 
events with respect to Q as Note that for A > 0, it may include balls whose 

centers do not appear in any promoter region. 

Finding the set 5 ) of balls, and annotating for each gene whether it matches each 
ball can be done in a straightforward manner. The time requirement then is • L • 4 
and the space requirement N ■ \B(^g s)\- 
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This genome-wide preprocessing needs to be done only once. Storing its results we 
can rapidly compute p- values of all events with respect to any proposed subset of 

genes. We simply look up which events occurred in the genes in the cluster, and then 
compute the hyper-geometric tail distribution. Furthermore, one may wish to increase, 
shrink, or shift the regions under consideration (e.g., from lOOObp to 2000bp upstream), 
or adjust the upstream regions of several genes (say, due to elucidation of exact tran- 
scription start site). While in general the preprocessing phase must be repeated, in prac- 
tice, since it is mainly made up of counting events, we may efficiently subtract, and 
add, respectively the counts in the symmetrical difference between the old and new sets 
of strings, avoiding repeating the complete process over again. With many completely 
sequenced genomes and gene expression data of model organisms in various settings 
just beginning to accumulate, our division of labour is especially useful. 

3.2 Reducing the Event Space 

The definition of holding all events we wish to examine, may include as many as 

min(4^, LN) balls. We note however, that many of these balls overlap. Thus, if a and 
cr' are two f-mers that differ, in the hamming distance example^, in exactly one letter, 
then the overlap between Ballijcr) and Ball 5 (cr') is clearly substantial. Moreover, if we 
notice that most of the “mass” of these balls (in terms of the number of occurrences in 
genes in Q) lies in the intersection, we expect that the significance of the events defined 
by both of them will be similar, since they will be highly correlated. 

A way to decrease the storage requirements, and thus extend the range of manage- 
able ts. can be found by a guided choice of a representative subset of B during 
preprocessing. Based on the above intuitions we want a covering set of balls with max- 
imal mass, to minimize the size of the subset, and minimal overlap, to diversify the 
events themselves. A heuristic solution can be offered in the form of a greedy algo- 
rithm. Starting from an empty subset we repeatedly choose balls of maximal mass that 
do not violate the minimal overlap demand, until we can no longer continue. We now 
proceed to examine and store the results only for the events corresponding to the chosen 
balls. 

We stress that since this sparsification is done during preprocessing, before we ob- 
serve any group G, it should not alter the statistical significance of the results we ob- 
serve when G is later given to us. 

4 Learning Finer Representations 

4.1 Position Specific Scoring Matrices 

Using the methods of the previous section we can collect a set of promising patterns 
that are significant for G. These patterns are based on the notion of a J-ball. Biologi- 
cal knowledge about transcription factor binding sites suggests that the definition of a 
binding site is in fact more subtle. Some positions are highly conserved, while others 
are less so. In the literature, there are two main representation of such sites. The first 

^ Analogous proximity thresholds can be defined for other distance measures. 
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is the lUPAC consensus sequences. This approach determines the consensus string of 
the binding site using a 15 letter alphabet that describe which subset of {A, C, G, T} is 
possible at each position. 

A position specific scoring matrix (PSSM) (see, e.g., ItlB) offers a more refined 
representation. A PSSM of length f is an object V — {pi, . . . ,pi], composed of i 
column distributions over the alphabet {A,C, G,T}. The distribution pi, specifies the 
probability of seeing each nucleotide at the Tth position in the pattern. 

Once we have a PSSM V, we can score each £-mer cr by computing its combined 
probability given V. A more common practice is to compute the log-odds between the 
PSSM probability and a background probability of nucleotides. Thus, if p o is assumed 
to be the nucleotide probability in promoter regions, then the score of an f-mer a is: 



If this score is positive a is more probable according to V than it is according to the 
background probability. In practice we set a threshold a (replacing zero) for detecting 
a pattern. Thus, a pair (7^, a) defines an event (s). This event occurs iff the best 
matching subsequence of length £ in s, or in its reverse complement, has a score higher 
than a. That is, if 

max{Score'p{s[i, ... ,i + £ — l]),Score'p{s[i, ... ,i + £ — 1]) > a 

i 

4.2 Selecting a Threshold 

Before we discuss how to learn the PSSM, we consider choosing a threshold a for a 
given PSSM V. It is possible to set a = 0, treating the background and the PSSM as 
equiprobable. However, since the pattern is a rarer event, we want a stricter threshold. 
Another potential approach tries to reduce the probability of false recognition. That is, 
to find an a such that the probability that a random background sequence a will score 
higher than a is smaller than a prespecified e. Then, if we want to allow on average one 
false detection every k genes, we would set e = -j^. Unfortunately, we are not aware 
of an efficient computational procedure to find such thresholds. 

Here we suggest a simple alternative. We search for a threshold a, such that the 
induced detections in the group G will be most significant. Thus, given a group G of 
genes, and a PSSM V, we search for 



That is, we adjust the threshold a so that the event defined by {V, a) has the smallest 
p-value with respect to G. This discriminative choice of a threshold ensures that we 
adjust it to take into account the amount of “spurious” matches to the PSSM outside 
of G. Thus, we strive for a threshold that maximizes the number of matches within G 
and at the same time minimizes the number of matches outside G. The use of p-values 
provides a principled way of balancing these two requirements. 

We can find this threshold quite efficiently. We compute the best score of the PSSM 
over each gene in Q, and sort this list of scores. We then evaluate only thresholds which 




a 



arg min p-value{G, I, . ) 

a. 
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are, say, half way between any two adjacent values in our list of sorted scores (each 
succeeding threshold admits another gene into the group of supposedly detected events). 
Using, for example, radix sort, this procedure takes time 0{NL). 



4.3 Learning PSSMs 



Learning PSSMs is composed of two tasks. Estimating the parameters of the PSSM 
given a set of training sequences that are examples of the pattern we want to match, and 
finding these sequences. The latter is clearly a harder problem and requires some care. 

We start with the first task. Suppose we are given a collection ui, . . . , (t„ of ^-mers 
that correspond to aligned sites. We can easily estimate a PSSM V that corresponds 
to these sequences. For each position i, we count the number of occurrences of each 
nucleotide in that position. This results in a count N{i, c) = ^{<^j [*] = c}. 

Given the counts we estimate the probabilities. To avoid entries with zero probabil- 
ity, we add pseudo-counts to each position. Thus, we assign 



Pi{c) 



N{i, c) + 7 
n + 47 



( 1 ) 



The key question is how to select the training sequences and how to align them. 
Our approach builds on our ability to find seeds of conserved sequences. Suppose that 
we find a significant (5-ball using the methods of the previous section. We can then use 
this as a seed for learning a PSSM. The simplest approach takes the £-mers that match 
the ball within the promoter regions of G as the training sequences for the PSSM. The 
learned PSSM then quantifies which differences are common among these sequences 
and which ones are rare. This gives a more refined view of the pattern that was captured 
by the 5-ball. 

This simple approach learns an £-PSSM from the 5-ball events found in the data. 
However, using PSSMs we can extend the pattern to a much longer one. We start by 
aligning not only the sequences that match the 5-ball, but also their flanking regions. 
These are aligned by virtue of the alignment of the core £-mers. We can then learn a 
PSSM over a much wider region (say 20bp). If there are conserved positions outside 
the core positions, this approach will find fhemfl 

Consider, for example, a HTH DNA binding motif, or a binding factor dimer, where 
each component matches 6- 10bps with several unspecific gap positions between the two 
specific sites. If we find one of the two sites using the methods of the previous sections, 
then growing a PSSM on the flanking regions allows us to discover the other conserved 
positions. 

Once we construct such an initial PSSM, we can improve it using a standard EM- 
like iterative procedure. This procedure consists of the following steps. Given a PSSM 
Vo, we compute a threshold ao as described above. We then consider each position in 
the training sequences and compute the probability that the pattern appears at that po- 
sition. Formally, we compute the likelihood ratio (Vo,ao) assigns to the appearance of 

^ This assume that there are no variable lengths gaps inside the patterns. The structural con- 
straints on transcription factors suggest that these are not common. 
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the pattern 2 Xs\i, . . . + We then convert this ratio to a probability by computing 



where logit(a;) = 1/(1 -|- e~“) is the logistic function. We then re-scale these prob- 
abilities by dividing by a normalization factor Z s so that the posterior probability of 
observing the pattern in s and its reverse complement sums to 1. Once we have com- 
puted these posterior probabilities, we can accumulate expected counts 



These represent the expected number of times that the i’th position in the PSSM takes 
the value c, based on the posterior probabilities. 

Once we collected these expected counts, we re-estimate the weights of the PSSM 
using Eq.Qto get a new a PSSM. We optimize the threshold of this PSSM, and repeat 
the process. Although this process does not guarantee improvement in the p- value of the 
learned PSSM, it is often the case that successive iterations do lead to significant such 
improvements. Note that our iterations are analogous to EM’s hill-climbing behaviour, 
and differ from Gibbs samplers where one performs a stochastic random walk aimed at 
a benehcial equilibrium distribution. 

5 Experimental Results 

We performed several experiments on data from the yeast genome to evaluate the util- 
ity and limitations of the methods described above. Thus, we focused on several re- 
cent examples from the literature that report binding sites found either using computa- 
tional tools or by biological verihcation. To better calibrate the results, we also applied 
MEME m, one of the standard tools in this field, on the same examples. 

In this first analysis we chose to use the simple hamming distance measure and treat 
the lOOObp sequence upstream of the ORE starting position as the promoter region. We 
note that the latter is a somewhat crude approximation, as this region also contains an 
untranslated region of the transcript. 

We ran our method in two stages. In the first stage, we searched for patterns of 
length 6-8 with S ranging between 0-2 mismatches, and an allowed ball overlap factor 
of 0-1. Generally speaking, in these runs the patterns found with no mismatches or 
ball overlaps had better p- values. This happens because we search for relatively short 
patterns, allowing for a non-trivial probability of a random match. Eor this reason we 
report below only results with exact matches and no overlap. We believe that higher 
values of both parameters will be useful for longer patterns (say of length 12 or 13). In 
the second stage we run the EM-like procedure described above on all the patterns that 
received signihcant scores. We chose to learn PSSMs of width 20 using 15 iterations of 
our procedure. 

To compare the results of these two stages, we ran MEME (version 3.0.3) in two 
conhgurations. The first restricted MEME to retrieve only short patterns of width 6- 
8, corresponding to our £-mers stage. The second conhguration used MEME’s own 
defaults for pattern retrieval resembling our end product PSSMs. 



Ps,i = logit (Score-Po(s[i, . . . , z + f - 1]) - ao) 
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We applied our procedure to several data sets from the recent literature. Selected 
results are summarized in Table [I] In this table we rank the top results from the different 



Table 1. Selected results on binding site regions of several yeast data sets, comparing 
our findings with those of MEME. 



Source/ 


Trans. 


Consensus 




Seed 


PSSM 


MEME < 8 


MEME < 50 


Cluster 


Factor 




rank 


p-value 


rank 


p-value 


rank 


e-value 


rank 


e- value 


Spellman et al. I22H 




















CLN2 


MBF 


ACGCGT 


1 


4e-26 


1 


3e-42 


1 


le-18 


1 


7e-31 


SICl 


SWI5p 


CCAGCA 


1 


le-07 


1 


le-12 


1 


8e-00 


8 


5e-t02 


Tavazoie 


et al. 1231 




















3 


putative 


GATGAG 


2 


9e-07 


5 


6e-09 


4 


le-l-06 


2 


le-14 




putative 


GAAAAatT 


3 


4e-07 


2 


le-11 


23 


8e-l-07 


3 


7e-10 


8 


STRE 


aAGGgG 


1 


6e-07 


3 


4e-06 


20 


le-t08 


- 


- 


14 


putative 


TTCGCGT 


1 


2e-09 


2 


7e-ll 


13 


le-l-07 


- 


- 




putative 


TGTTTgTT 


3 


2e-07 


- 


- 


- 


- 


13 


4e-l-05 


30 


MET31/32p 


gCCACAgT 


1 


2e-ll 


1 


2e-ll 


2 


5e-t02 


8 


le+03 


Iyer et al. II 161 




















MBF 


MBF 


ACGCGT 


1 


le-12 


1 


3e-18 


3 


le-l-04 


19 


le-03 


SBF 


SBF 


CGCGAAA 


1 


le-32 


1 


le-37 


2 


le-17 


- 


- 



runs of each procedure by their p- values (or e-values) reported by the programs after 
removing repeated patterns. We report the relative rank of the patterns singled out in 
the literature and their significance scores. We discuss these results in order. 

The first data set is by Spellman et al. [El- They report several cell-cycle related 
clusters of genes. In a recent paper, Sinha and Tompa report results of a systematic 
search for binding sites in these clusters of lUPAC consensus regions using a random 
sequence null hypothesis utilizing a Markov chain of order 3. The main technical devel- 
opments in EB are methods for approximating the p- value computation with respect to 
such a null-hypothesis. 

We examined two clusters reported on by Sinha and Tompa. In the first one, CLN2, 
our method identifies the pattern ACGCGT and various expansions of it. This pattern was 
found using patterns of length 6, 7, and 8 with significant p- values. The PSSMs learned 
from these patterns were quite similar, all containing the above motif. Figure ^Ja) shows 
an example. In the second cluster, SICl, the signal appears with a marginal p- value 
(close to the Bonfferoni cutoff) already at f = 6. The trained PSSM recovers the longer 
pattern with a significant p-value. In both cases, the top ranking patterns correspond to 
the known binding site. 

The second data set is by Tavazoie et al. El- That paper also examines cell-cycle 
related expression levels that were grouped using fc-means clustering. They examined 
30 clusters, and applied an MCMC-based procedure for finding PSSM patterns in the 
promoter regions of genes in each cluster. We examined the clusters they report as 
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Fig. 1. Examples of PSSMs Learned by Our Procedure, (a) CLN2 cluster, (b) SBF clus- 
ter. (c) Gasch et al. Cluster M. (d) Gasch et al. Cluster EJ. 



statistically significant, and were able to reproduce binding sites that are very close to 
the PSSMs they report; see TahleOJ 

In a recent paper, Iyer at al. ill bl identify, using experimental methods, two groups 
of genes that are regulated hy the MBF/SBF transcription factor. Here, again, we man- 
aged to recover the binding sites they discuss with high confidence. For example, we 
show one of our matching PSSMs in Figure Eb). 

Finally, we discuss the recent data set of yeast response to environmental stress by 
Gasch et al. m. We report on two clusters of genes “M”, and “EJ”. In cluster M the 
string CACGTGA is found in several of the highest scoring patterns. However, when we 
turned to grow PSSMs out of our seeds, a matrix of a lower ranking seed GATAAGA 
exceeded the rest, exemplifying that seed ordering is not necessarily maintained when 
the patterns are extended. The latter, more prominent PSSM is shown in Figure fflc). In 
cluster EJ a significant short pattern rising above our threshold is not found. However 
when we extended the top most seed we obtained the PSSM of Figure Qld) which both 
nearly crosses our significance threshold, and holds biological appeal, showing two 
conserved short regions flanking a less conserved 2-mer. 

In general, the scores of the learned PSSMs vary. In some cases, the best seeds 
yield the best scoring PSSMs. More often, the best scoring PSSM corresponds to a seed 
lower in the list (we took into account only seeds that have p- value matching the FDR 
decision threshold). In most cases the PSSM learned to recognize regions flanking the 
seed sequence. In some cases more conserved regions were discovered. In general our 
approach manages to identify short patterns that are close to the pattern in the data. 
Moreover, using our PSSM learning procedure we are able to expand these into more 
expressive patterns. 

We note that in most analysed cases MEME also identified the shorter patterns. 
However, there are two marked differences. First and foremost is run time. Compared 
on a 733 MHz Pentium III Linux machine our seed discovery programs ran between 
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half a minute and an hour, exhaustively examining all possible patterns, while the EM- 
like PSSM growing iterations added a couple of minutes. The shortest MEME run on 
the same data sets took about an hour, while longer ones ran for days, when asked to 
return only the top thirty patterns. Second, MEME often gave top scores to spurious pat- 
terns that are clear artifacts of the sequence distributions in the promoter regions (such 
as poly A’s). When using MEME one can try to avoid these problems by supplying a 
more detailed background model. This has the effect of removing most low complex- 
ity patterns from the top scoring ones. Our program avoids most of these pitfalls by 
performing its significance tests with respect to the genome background to begin with. 



6 Discussion 



In this paper we examined the problem of finding putative transcription factor binding 
sites with respect to a selected group of genes. We advocate significance calculations 
with respect to the random selection null hypothesis. We claim that this hypothesis 
is both simple and clear and is more suitable for gene expression experiments than 
the random sequence null hypothesis. We then use a simple hyper-geometric test in 
a framework for constructing models of binding sites. This framework starts by sys- 
tematically scanning a family of simple “seed” patterns. These seeds are then used for 
building PSSMs. We describe how to construct statistical tests to select the most surpris- 
ing threshold value for a PSSM and combine this with an EM-like iterative procedure 
to improve it. We thus combine a first phase of kernel identification based on a rigorous 
statistical analysis of word over-representation, with a subsequent phase of optimiza- 
tion, leading to a PSSM, which can be used to scan sequences for new matches of the 
putative regulatory motif. 

We showed that even before performing iterative optimization of the PSSMs, our 
method recovers highly selective seed patterns very rapidly. We reconstructed results 
from several recent papers that use more elaborate and computationally intensive tools 
for finding binding sites, as well as present novel binding sites. 

A potential weakness of our model is the fact that we disregard multiple copies of a 
match in the same sequence (the restriction to binary events). Despite the fact that this 
phenomenon is known to happen in eukaryotic genes, we recall that a mathematical 
analysis of counting the number of occurrences in a single string is more elaborate, 
and computationally intensive. This may indeed lead in such cases to under-estimation, 
which is problematic mainly for small clusters of co-regulated genes. The recognition 
of two conserved patterns separated by a relatively long spacer (say of lObp or more), 
resulting from a HTH motif or a dimer complex, can however be attacked by looking 
for proximity relationships between pairs of occurrences of different significant seeds. 

As this field is showing an influx of interest, our work resembles several others in 
different aspects. We highlight only the most relevant ones. 

The use of the hyper-geometric distribution in the context of finding binding sites is 
used by Jensen and Knudsen [OH to find short conserved subsequences of length 4-6 
bp. They demonstrate the ability to reconstruct sequences, but suffer statistical problems 
when they consider longer £-mers, due to the large number of competing hypotheses. 
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Already in Galas et al. flU, word statistics are used to detect over-represented mo- 
tifs, and a definition of a general concept of “word neighborhood” is given similar 
to the ball definition we give here. However, the analysis there is restricted to over- 
representations at specific positions with respect to a common point of reference across 
all sequence, deeming it mostly appropriate for prokaryotic transcription or translation 
promoter region elucidation. 

The general outline of our approach is similar to that of Wolferstetter et al. [^] and 
Vilo et al. Ei- Both search for over-represented words and try to extend them. Vilo 
et al. examine £-mers of varying sizes that are identified by building a suffix tree for 
the promoter regions. Then, they use a binomial formula for evaluating significance. 
For the clustering they constructed, this resulted in a very large pool of sequences (over 
1500). They use multiple alignment-like procedure for combining these f-mers into 
longer consensus regions. Thus, to learn longer binding sites with variable position, 
they require overlapping subsequences to be present in the data. This is in contrast to 
our approach that uses PSSMs to extend the observed patterns, and so is more robust to 
highly variable positions that flank the conserved region. 

Van Helden et al. d also use binomial approach. They try to take into consider- 
ation the presence of multiple copies of a motif in the same sequence, but suffer from 
resulting inaccuracies with respect to auto-correlating patterns. Our work can be seen as 
generalizing this approach in several respects, including the use of a hyper-geometric 
null model, the discussion of general distance functions and event space coarsening, 
and the iterative PSSM improvement phase. 

There are several directions in which we can extend our approach, some of them 
embedding ideas from previous works into our context. 

First, in order to estimate the sensitivity of our model it will be interesting to exam- 
ine it on smaller, and known, gene families, as well as on synthetic data sets, as those 
advocated in mni . Extending our empirical work beyond yeast should also provide new 
insights and challenges. 

Our method treats the complete promoter region as a uniform whole. However, bi- 
ological evidence suggests that the occurrence of binding sites can depend on the posi- 
tion within the promoter sequence E3. We can easily augment our method by defining 
events on sub-regions within the promoter sequence. This will facilitate the discovery 
of subsequences specific to certain positions. Another biological insight already men- 
tioned is the phenomena of two conserved patterns separated by a relatively long spacer. 
In the case of homeodimers we can easily expand our scope to handle events that require 
two appearances of the subsequence within the promoter region. Otherwise, we can try 
to extend our PSSMs further to flank the seed while weighting each column such as to 
allow for longer spacers between meaningful sub-patterns. 

So far we have looked for contiguous conserved patterns within the binding site. 
More complex extensions involve defining new distance measures that incorporate pref- 
erences for more conserved positions in specific positions in the pattern, and random 
projection techniques, akin to [H, which will allow us to easily handle longer f-mers. 
We can also further generalize our model by allowing ourselves to express our ^-mer 
centroids over the lUPAC alphabet. This allows both for a reduction of the event space 
and the natural incorporation of biological insight, as outlined above. Our current 
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method for diluting the set of “covering” J-balls is highly heuristic. Interesting theo- 
retical issues include the formal criteria we should optimize in selecting this approxi- 
mating set of (5-balls and how to efficiently optimize with respect to such a criterion. 
Finally, we intend to combine the putative sites we discover with learning methods that 
learn dependencies between different sites and between sites and other attributes such 
as expression levels and functional annotations O' 
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Abstract. Using current technology, large consecutive stretches of DNA (such 
as whole chromosomes) are usually assembled from short fragments obtained by 
shotgun sequencing, or from fragments and mate-pairs, if a “double-barreled” 
shotgun strategy is employed. The positioning of the fragments (and mate-pairs, 
if available) in an assembled sequence can be used to evaluate the quality of 
the assembly and also to compare two different assemblies of the same chro- 
mosome, even if they are obtained from two different sequencing projects. This 
paper describes some simple and fast methods of this type that were developed to 
evaluate and compare different assemblies of the human genome. Additional ap- 
plications are in “feature-tracking” from one version of an assembly to the next, 
comparisons of different chromosomes within the same genome and comparisons 
between similar chromosomes from different species. 



1 Introduction 

Although current technology for DNA sequencing is highly automated and can deter- 
mine large numbers of base pairs very quickly, only about (on average) 550 consecutive 
base pairs (bp) can be reliably determined in a single read [El . Thus, a large consecutive 
stretch of source DNA can only be determined by “assembling” it from short fragments 
obtained using a shotgun sequencing strategy [El. In a modification of this approach 
called double-barreled shotgun sequencing m, larger clones of DNA are sequenced 
from both ends, thus producing mate-pairs of sequenced fragments with known relative 
orientation and approximate separation (typically, employing a mixture of 2kb, 5kb, 
lOkb, 50kb and 150kb clones). So, usually a sequencing project produces a collection 
of fragments that are randomly sampled from the source sequence. The average num- 
ber X of fragments that cover any given position in the source sequence is known as the 
fragment x-coverage. 

Given two different assemblies of the same chromosome-sized source sequence, 
possibly obtained from two different sequencing projects, how can one evaluate and 
compare them? The aim of this paper is to present some fast and simple methods 
addressing this problem that are based on fragment and mate-pair data obtained in 
a sequencing project for the source sequence. Additional applications are in tracking 
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forward “features” from one version of an assembly to the next, comparison of differ- 
ent chromosomes from the same genome and of similar chromosomes from different 
species. Although each method on its own is just an implementation of a simple idea or 
heuristic, our experience is that the integration of these methods gives rise to a powerful 
tool. We originally developed this tool to compare different assemblies of the human 
genome, see Figures 6 and 7 in Q. 

In Section 0 we discuss assembly evaluation and comparison techniques based on 
fragments. In particular, we introduce the concept of “segment discrepancy” that mea- 
sures by how much the positioning of a segment of conserved sequence differs between 
two assemblies. Then we present some mate-pair based methods in Section 0 includ- 
ing a useful breakpoint detection heuristic. Finally, we demonstrate the utility of these 
methods in Section E] 

2 Fragment-Based Analysis and Comparison Methods 

Several useful methods for evaluating a single assembly or comparing two assemblies — 
such as sequencing coverage, dot-plots, or line-plots — can be implemented in terms of 
the positions in an assembly to which fragments are assigned. 

For our purposes, a contig is simply a finite string A — a 1 02 . . . of characters 
ai € {A, C, G, T, N} representing a stretch of contiguous DNA, where A, C, G and T 
correspond to the four bases and N stands for “unknown”. An assembly is a contig A 
that was obtained from the fragments of some sequencing project using some assembly 
algorithm, without elaborating on the details. A run of consecutive N’s represents an 
undetermined sequence part, and the number of N’s in the run is sometimes used to 
represent its estimated length. 

A fragment is a string F = / 1/2 . . . of characters fi G {A, C, G,T}, of length 
len(F) usually less than 900. We say that a fragment F hits (or is recruited by) an 
assembly A if F globally aligns to A with high identity (e.g. 94% or more). In this 
case, we use s{F, A) and t{F, A) to denote the position in A to which the first character 
and last character of F align to, respectively. In particular, a fragment aligns in the 
forward direction if s{F, A) < t{F, A), whereas the alignment is against the reverse- 
complement of F if s(F, A) > t{F, A). For simplicity, we will assume that all s values 
are distinct, i.e., s{F) ^ s{G) for any two different fragments that hit A. (In practice, 
fragment coordinates do sometimes agree, but our experience is that one can simply 
ignore such fragments without a substantial loss of coverage.) 

Given a set of fragments T and an assembly A, we use F{A) to denote the set of all 
fragments in F that hit A. If an assembly A was obtained by assembling fragments from 
a set T , then the set F{A), and the values of s(F, A) and t{F, A) for all F G F(A), are 
known. If an assembly A of a chromosome is obtained from one sequencing project, 
and the set of fragments F available was obtained from a different sequencing project 
studying the same chromosome, then a fast high-fidelity alignment program [El can be 
used to compute F{A). 
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2.1 Fragment-Coverage Plot 

Let A be an assembly and J^(A) a set of fragments that hit A. For each fragment F € 
F(A) define a begin-event (min{s(F, A), t(F, A)}, +1) and an end-event 
(max{s(F, A), t{F, A)}, —1). To obtain a fragment-coverage plot for A, consider all 
events (x,e) in order of their first coordinate x and for each begin-event, plot the num- 
ber of fragments that span x, given by the number of begin-events minus the number of 
end-events seen so far, see Figure [0 
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Fig. 1. Fragment-Coverage Plot for a 1 Mb Region of Chromosome 2 of Human [H.The 
assembly A is represented by a line segment [1, len(A)] along the x-axis. The number 
of fragments uniquely hitting A is plotted as a function of their position. 



A fragment-coverage plot is useful because poorly assembled regions often have 
low fragment-coverage, whereas regions of repetitive sequence can be identified as 
those stretches of sequence that are hit by unusually high numbers of fragments. 

In practice, one can easily accomodate for fragments hitting multiple times. How- 
ever, for ease of exposition, throughout this paper we will assume that F{A) is the set 
of all fragments that uniquely hit A. 

2.2 Dot-Plot and Line-Plot 

Consider two different assemblies A and B of the same chromosome, and assume that 
a set T of fragments obtained from a shotgun sequencing project for the chromosome 
is given. Once we have determined iF( A) and T{B), how can we visualize this data? 

Let F{A, B) F{A) n F{B) denote the set of fragments that hit both assem- 
blies. A simple dot-plot can be produced by plotting {x,y) with x := s{F,A) and 
y := s{F, B) for all F G iF(A, B), see FigureQ at higher resolution, plot a line from 
(s(F, A), s(G, B)) to (t{F, A),t{G,B)). Alternatively, represent assembly A and B by 
a line segment from (1, 0) to (len(A), 0) and from (1, 1) to (len(i3), 1), respectively. A 
simple line-plot showing matching regions of the two assemblies is obtained by draw- 
ing a line segment between (s(F, A),0) and {s{F,B),l) for all F G F{A,B), see 
Figure El 

If T{A) is given, but F{B) is unknown, then a short-cut to recruiting fragments 
to B is to compute Fa{B) := {F G J-{A) \ F hits 5} instead of T{B), at the price 
of obtaining a less comprehensive analysis. Alternatively, one could first compare the 
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Fig. 2. Fragment based dot-plot comparison of two different assemblies of a 6Mb region 
of chromosome 2 in human. Each point represents a fragment that hits both assemblies. 
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Fig. 3. Fragment Based Line-Plot Comparison. Each line segment represents a fragment 
that hits both assemblies. Medium grey lines represent fragments contained in the heav- 
iest common subsequence (HCS) of consistently ordered and oriented segments, light 
grey lines represent consistently oriented segments that are not contained in the HCS, 
and dark grey lines represent fragments (or segments) that have opposite orientation in 
the two assemblies. 



consensus sequence of assembly B directly against that of assembly A and then project 
fragments from A onto B wherever compatible with the segments of local alignment 
between A and B. 

2.3 Fragment Segmentation 

For analysis purposes and also to speed up visualization significantly, it is useful to 
segment the fragment matches by determining the maximal consistent and consecutive 
runs of them. 

Consider a fragment F G J-{A, B). We say that F has preserved orientation, if and 
only if F has the same orientation in A and B, i.e., if either both s{F, A) < t{F, A) 
and s{F,B) < t{F,B), or both s{F,A) > t{F,A) and s{F,B) > t{F,B) hold. 
Let F^{A, B) denote the set of all fragments that have preserved orientation and set 
B-{A,B) := F{A,B)\F+{A,B). 
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For any two fragments F,G £ F{A, B), define F G, if s{F, A) < s{G, A), 
and define F <b G, if s{F, B) < s{G, B). Because we assume that all s values are 
distinct, these are both total orderings and we use pred^(F) and succa(F’) to denote 
the <^-predecessor and < ^-successor of F, respectively. 

A sequence S = (fi, F2, . . ■ , Fk) of fragments is called a matched segment, in 
either of the two following cases: 

1. {Fi,F2,...,Fk} C F+{A,B) and sucCyi(Fi) = succs(J^i) for alH = 1,2,..., 
fc — 1, or 

2. C F~{A,B) andsuccA(F^) = pred^(Fi) for alH = 1,2,..., 
k-1. 

A matched segment is called maximal, if it can’t be extended. 

Let S := S{F{A, B)) = {51, S2, ■ • ■ , Sn} denote the set of all maximal matched 
segments of F{A, B), and let 5''' and S~ denote the subset of such segments in cases 
^and0 respectively. Both 5"*" and S can be computed in a simple loop that consid- 
ers each fragment in < ^ order and decides whether it extends the current segment or 
defines the start of a new one. 

The A-support of a matched segment S = {Fi, F2, . . . , F^) is defined as the in- 
terval [s(5, A), f(5, A)], with s(5, A) := mini7’gs(s(F, A), t(F, A)) andt(5, A) := 
maxFgs(s(F, A),t{F, A)). The B-support is defined similarly. Let len(5) denote the 
minimum length of the A- and i?-supports of S. 

2.4 Heaviest Common Subsequence 

Given two orderings Oi and O 2 of the set of numbers (1, 2, . . . , n} (for some fixed 
number n) and a weight function w : {1, 2, . . . , n} ^ N-°. A subsequence FI := 
H(0i,02,w) of both orderings is called a heaviest common subsequence, if it has 
maximal weight w{H) := heaviest common subsequence can be 

computed in 0 {n log n) time and space, see [|3. 

For S = {Si, S2, . . ■ , Sn), let Oi and O2 denote the ordering of the indices 1, 2, 

. . . , n induced by the orderings of S defined by s(-. A) and s{-,B), respectively. With 
weight function w{i) len(5i), compute the heaviest common subsequence H of Oi 
and 02- 

We call Ti. '■= {Si £ S \ i £ H} the heaviest common subsequence of matched 
segments. We can distinguish between four categories of matched segments: 

1. 5+ n is the set of segments that have the same ordering and orientation in both 
assemblies, 

2. S~ n is the set of segments that have the same position in both assemblies, but 
are inverted with respect to each other, 

3. 5'^' \ is the set of segments that have transposed positions, and 

4. \ is the set of segments that appear both transposed and inverted. 

The amount of sequence contained in each of these four categories is a good mea- 
sure of how similar two assemblies are. In visualization, using different colors for each 
of them significantly enhances the dot-plot and line-plot representation described above, 
see Figure0. 
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2.5 Segment Displacement 

Consider two segments S = {Fi, F 2 , . ■ .) and T = {Gi, G 2 , ■ ■ ■)■ We say that S and 
T are para/Ze/ if either both s(Fi, ands(i^i,i?) < s(Gi, i3), or both 

s(Fi, yl) > s(Gi, 2 I) and s{Fi,B) > s{Gi,B) hold. 

It seems reasonable to “trust” those portions of the two assemblies that are covered 
by segments from the heaviest common subsequence . Thus, we propose to measure 
the amount by which the positioning of a segment S not in S + n H differs in the two 
assemblies as follows; We define the displacement D{S) associated with S as the sum 
of lengths of all segments in that are not parallel to S. In Figure 0we plot segment 
length vs. segment displacement. 



Segment 

length 




Displacement 



Fig. 4. Scatter-Plot Comparison of Two Assemblies: a dot {x, y) represents a sequence 
segment S of length len(S') = x whose displacement D{S) is y. In other words, the 
placement of S in the two assemblies differs by at least D{S) bp. Note that points along 
the x-axis correspond to in-place inversions. 



3 Mate-Pair-Based Evaluation Methods 

Let A and B be two assemblies of a chromosome and let ZF be a set of associated frag- 
ments. Assume now that the fragments in F were generated using a “double-barreled” 
shotgun protocol in which mate-pairs of fragments are obtained by reading both ends 
of longer clones. For purposes of this paper, a mate-pair library M = {L, p,a) consists 
of a list L of pairs of mated fragments, together with a mean estimate /i and standard 
deviation a for the length of the clones from which the mate-pairs were obtained, see 
Figure 0 

Typical clone sizes used to produce mate-pair libraries used in Celera’s human 
genome sequencing were 2kb, lOkb, 50kb, and 150kb. The quality of shorter mate-pairs 
can be very good with a standard deviation of about 10% of the mean length, whereas 
the standard deviation can reach 20% for long clones. Also, because both ends of clones 
are read in separate sequencing reactions, there is a potential for mis-associating mates. 
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Fig. 5. Two fragments F and G that form a mate-pair with known mean distance fi and 
standard deviation a. Note their relative orientation in the source sequence. 



However, a high level of automation and electronic sample tracking can reduce the oc- 
currences of this problem to below 1%. By construction, any fragment will occur in at 
most one mate-pair. 

Given an assembly A with fragments F{A) and a collection of mate-pair libraries 
Ai = {Ml, M 2 , . . .}, let m = {F, G} C A{A) be a mate-pair occurring in some 
library Mi = {L, n, a). Then m is called happy if the positioning of F and G in A 
is reasonable, i.e., if F and G are oriented towards each other (as in Figure n and 
I \s{F, A) — s(G, I — /r| < 3cr, say. An unhappy mate-pair m is called mis-oriented if 
the former condition is not satished, and mis-separated if only the latter condition fails. 

3.1 Clone-Middle Plot 

We obtain a clone-middle plot for A as follows: For each pair of fragments F,G G 
F{A) that occurs in a mate-pair library M, draw a line segment from (f(F, A), j/) to 
(f(G, A), y) , where y G [0, 1] is a randomly chosen height. Lines can be shown in dif- 
ferent colors depending on whether the corresponding mate-pair is happy, mis-separated 
or mis-oriented, see Figure^l and also Figure 6 in Q. The interval [t{F, A), t{G, A)] 
(assuming w.l.o.g. t{F, A) < f(G, A)) is called the clone-middle (in A) associated with 
the pair F, G. 

One draw-back of this visualization for large assemblies is that substantially mis- 
placed pairs give rise to very long lines in the plot and obscure the view of local regions. 
To address this, we introduce the localized clone-middle plot (see Figure [Tj): Let {f, G} 
be a mis-separated or mis-oriented mate from some library M = (L,p,,a). Assume 
w.l.o.g. that s{F, A) < s(G, A). Represent the mate-pair by a line that indicates the 
range in which F expects to see G, i.e., by drawing a line segment from t{F, A) of 
length /r -f 3cr — (len(F) -h len(G)) towards the right, if s{F, A) < t{F, A), and to the 
left, otherwise. As above, define the clone-middle accordingly. 

Mis-separated and mis-oriented mate-pairs indicate discrepancies between a given 
assembly and the original source sequence or chromosome, as follows. 

3.2 Breakpoint Detection 

Loosely speaking, a breakpoint of an assembly A is a position p in A such that the 
sequence immediately to the left and right of p in A comes from two separate regions 
of the source sequence. 

Let m = |F, G} be a mis-oriented mate-pair such that s(F, A) < s(G, A). We 
distinguish between three different cases: norffiaZ-oriented: both fragments are oriented 
to the right; anti-oriented: both are oriented to the left; and oMttie-oriented: F is oriented 
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Fig. 6. Clone-Middle Diagram for Assemblies A and B. Each mate-pair m is repre- 
sented by a horizontal line segment joining its two fragments, if m is mis-separated 
(shown in light grey) or mis-oriented (shown in dark grey). Happy mates are not shown. 
Mate-pairs are grouped by “library”, labeled 2K, IQK and 50AT. Ticks along the axis 
indicate putative breakpoints, as inferred from the mis-oriented mates. 



to the left and G is oriented to the right. (Happy and mis-separated mates are innie- 
oriented). 

We now describe a simple but effective heuristic for detecting breakpoints. Choose 
a threshold T > 0, depending on details of the sequencing project. (All figures in this 
paper were produced using T = 5.) An event is a three-tuple {x, t, a) consisting of a 
coordinate x G {1, . . . , len(A)}, a type t G {normal, anti, outtie, mis-separated}, and 
an “action” a G {+1, —1}, where -f 1 or —1 indicates the beginning or end of a clone- 
middle, respectively. We maintain the number of currently alive mates V (f) of type 
t. For each event e = (x,f, a) in ascending order of coordinate x: If a = +1, then 
increment V (t) by 1. In the other case (a = —1), if V (f) > T, then report a breakpoint 
at position x and set V (t) = 0, else decrease V (t) by 1. (For a better estimation of the 
true position of the breakpoint, report the interval [x', x], where x' is the coordinate of 
the most recent alive -F 1-event of type t.) Breakpoints estimated in this way are shown 
in Figure 0 

A useful variant of the breakpoint estimator is obtained by taking the current number 
of alive happy mates into account: Scanning from left to right, a breakpoint is said to 
be present at position x if there exists an event e = (x, t, —1) such that the number of 
alive unhappy mates of type t exceeds the number of alive happy mates of type t. 
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Fig. 7. A Localized Clone-Middle Diagram for Assemblies A and B. Here, each mis- 
separated or mis-oriented mate-pair is represented by a line that indicates the expected 
range of placement of the right mate with respect to the left one. Ticks along the axis 
indicate putative breakpoints, as inferred from the mis-oriented mates. 



3.3 Clone-Coverage Plot 

Similar to the fragment-coverage plot discussed in Section 0 one can use the clone- 
coverage events to compute a clone-coverage plot for each of the types of mate-pairs, 
see Figure[3 

Note that the simultaneous occurrence of both high happy and high mis-separated 
coverage may indicate the presence of a polymorphism in the fragment data. 

3.4 Synthesis 

Combining all the described methods into one view gives rise to a tool that is very 
helpful deciding by how much two different assemblies differ and, more, which one is 
more compatible with the given fragment and mate-pair data; see Figure 0 This latter 
capability is an especially powerful aspect of analysis in terms of fragments and mate- 
pairs. 



4 Some Applications 

The techniques described in this paper have a number of different applications in com- 
parative genomics. Originally, our goal was to design a tool for comparing the simi- 
larities and differences of assemblies of human chromosomes produced at Celera with 
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Fig. 8. Clone-coverage plot for assemblies A and B, showing the number of of happy 
mate-pairs (medium grey), mis-separated pairs (light grey) and mis-oriented ones (dark 
grey). 




Fig. 9. A combined line-plot, clone-middle, clone-coverage and breakpoint view of the 
two assemblies A and B indicates that assembly A is significantly more compatible 
with the given fragment and mate-pair data than assembly B is. 
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those produced by the publicly funded Human Genome Project (PFP). A detailed com- 
parison based on our methods is shown in Figures 6 and 7 of iB- As an example, we 
show the comparison for chromosome 2 in Figure uni For clarity, only segments of 
length 50kb or more are shown. 



0 50 100 150 200 Mb 
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Fig. 10. Line-plot and breakpoint comparison of two different assemblies of chromo- 
some 2 of human. Assembly C was produced at Celera o and assembly H was pro- 
duced in the context of the publicly funded Human Genome Project and was released 
on September 5, 2000 . The number of detected breakpoints (indicated as ticks along 

the chromosome axes) is 73 for C and 3592 for H. 



4.1 Feature-Tracking 

A second application is in tracking forward features from one version of an assembly 
to the next. To illustrate this, we consider two assemblies of chromosome 19 produced 
in the context of the PFP from publicly available data. Assembly Hi was released on 
September 5, 2000 and assembly H 2 was released on January 9, 2001 [Q. 

How much did the assembly change and did it improve? The line-plot comparison of 
Hi and H 2 in FigureEl indicates that many local changes have taken place. A detailed 
analysis (not reported here) shows that many changes are due to a change of orienta- 
tion of so-called “supercontigs” in the assembly. The number of detected breakpoints 
dropped from 723 to 488. 
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Fig. 11. Line-plot, clone-middle and breakpoint comparison of the PFP assembly H 1 
of chromosome 19 as of September 5, 2000, and the a more recent PFP assembly H 2 
dating January 9, 2001. 



4.2 Comparison of Different Chromosomes 

Additionally, our algorithms can be used to compare different chromosomes of the same 
species e.g. in search of duplication events, but also to compare different chromosomes 
from different species, in the latter case using a lower stringency alignment method to 
define fragment hits. 

We illustrate this by a comparison of chromosome X and Y of human, as described 
in m. In this analysis we use only uniquely hitting fragments. In summary, we see 
approximately 1.3Mb of sequence in conserved segments, of which 164kb are contained 
in the heaviest common subsequence (relative to the standard orientation of X and Y), 
82kb are contained in other segments of the same orientation and 1.05Mb in oppositely 
oriented segments, see Figure El We observe orientation preserving similarity at both 
ends of the chromosomes and a large inverted conserved segment in the interior of X. 
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Fig. 12. A line-plot comparison of chromosome X vs. y of human showing segments 
of highly conserved non-repetetive sequence. 



3. G. Jacobson and K.-P. Vo. Heaviest increasing/common subsequence problems. In Pro- 
ceedings 3rd Annual Symposium on Combinatorial pattern matching (CPM), pages 52-66, 
1992. 

4. E. W. Myers, G. G. Sutton, A. L. Delcher, I. M. Dew, D. P. Fasulo, M. J. Flanigan, S. A. 
Kravitz, C. M. Mobarry, K. H. J. Reinert, K. A. Remington, E. L. Anson, R. A. Bolanos, H-H. 
Chou, C. M. Jordan, A. E. Halpern, S. Lonardi, E. M. Beasley, R. C. Brandon, L. Chen, P. J. 
Dunn, Z. Lai, Y. Liang, D. R. Nusskern, M. Zhan, Q. Zhang, X. Zheng, G. M. Rubin, M. D. 
Adams, and J. C. Venter. A whole-genome assembly of Drosophila. Science, 287:2196-2204, 
2000 . 

5. F. Sanger, A. R. Coulson, G. F. Hong, D. F. Hill, and G. B. Petersen. Nucleotide sequence of 
bacteriophage A DNA. J. Mol. Bio., 162(4):729-73, 1992. 

6. F. Sanger, S. Nicklen, and A. R. Coulson. DNA sequencing with chain-terminating inhibitors. 
Proceedings of the National Academy of Sciences, 74(12);5463-5467, 1977. 

7. J. C. Venter, M. D. Adams, E. W. Myers, et al. The sequence of the human genome. Science, 
291:1145-1434, 2001. 



Author Index 



Anantharaman, Thomas, 27 

Barash, Yoseph, 278 
Bejerano, Gill, 278 
Bergeron, Anne, 164 
Bouchez, Martin, 41 

Caprara, Alberto, 238 
Casey, Will, 52 
Chabrier, Patrick, 41 
Chor, Benny, 204 

Denise, Alain, 85 

Edelsbrnnner, Herbert, 112 
Elofsson, Arne, 128 
Eriksen, Niklas, 227 
Eriksson, Olivia, 128 

Ganapathysaravanabavan, 
Ganeshknmar, 156 
Gilbert, David, 98 
Graur, Dan, 142 

Friedman, Nir, 278 

Halpern, Aaron L., 294 
Hasegawa, Masami, 142 
Heber, Steffen, 252 
Hellendoorn, J., 264 
Hendy, Michael, 204 
Huson, Daniel H., 294 

Jagota, Arun, 69 

Lai, Zhongwn, 294 
Lyngs0, Rune B., 69 

Meloen, R.H., 264 
Miklos, Istvan, 1 
Milan, Denis, 41 
Mishra, Bud, 27, 52 



Moret, Bernard M.E., 189 
Myers, Eugene W., 294 

Nakhleh, Lnay, 214 

Ohsen, Niklas von, 11 

Pedersen, Christian N.S., 69 
Penny, David, 204 
Pupko, Tal, 142 

Regnier, Mireille, 85 
Reinders, M.J.T., 264 
Reinert, Knut, 294 
Roshan, Usman, 214 

Schiex, Thomas, 41 
Shamir, Ron, 142 
Sharan, Roded, 142 
Siepel, Adam C., 189 
Slootstra, J.W., 264 
St.John, Katherine, 214 
Stoye, Jens, 252 
Strasbourg, Francjois, 164 
Sun, Jerry, 214 
Sutton, Granger G., 294 

Toroczkai, Zoltan, 1 

Vandenbogaert, Mathias, 85 
Veen, Peter J. van der, 264 
VTksna, Juris, 98 

Wang, Li-San, 175 
Warnow, Tandy, 156, 214 
Wessels, L.F.A., 264 
Wigler, Mike, 52 

Zhou, Yishao, 128 
Zimmer, Raff, 11 
Zomorodian, Afra, 112 




